Code Monkey home page Code Monkey logo

annieslasso's Introduction

The Cannon

Build Status Coverage Status Scrutinizer License

Authors

  • Andy Casey (Cambridge) (Monash)
  • David W. Hogg (NYU) (MPIA) (SCDA)
  • Melissa K. Ness (MPIA)
  • Hans-Walter Rix (MPIA)
  • Anna Y. Q. Ho (Caltech)
  • Gerry Gilmore (Cambridge)

Installation

pip install https://github.com/andycasey/AnniesLasso/archive/master.zip

Getting Started

Let us assume that you have rest-frame continuum-normalized spectra for a set of stars for which the stellar parameters and chemical abundances (which we will collectively call labels) are known with high fidelity. The labels for those stars (and the locations of the spectrum fluxes and inverse variances) are assumed to be stored in a table. In this example all stars are assumed to be sampled on the same wavelength (dispersion) scale.

Here we will create and train a 3-label (effective temperature, surface gravity, metallicity) quadratic (e.g., Teff^2) model:

import numpy as np
from astropy.table import Table

import AnniesLasso as tc

# Load the table containing the training set labels, and the spectra.
training_set = Table.read("training_set_labels.fits")

# Here we will assume that the flux and inverse variance arrays are stored in
# different ASCII files. The end goal is just to produce flux and inverse
# variance arrays of shape (N_stars, N_pixels).
normalized_flux = np.array([np.loadtxt(star["flux_filename"]) for star in training_set])
normalized_ivar = np.array([np.loadtxt(star["ivar_filename"]) for star in training_set])

# Providing the dispersion to the model is optional, but handy later on.
dispersion = np.loadtxt("common_wavelengths.txt")

# Create a vectorizer that defines our model form.
vectorizer = tc.vectorizer.PolynomialVectorizer(("TEFF", "LOGG", "FEH"), 2)

# Create the model that will run in parallel using all available cores.
model = tc.CannonModel(training_set, normalized_flux, normalized_ivar,
                       vectorizer=vectorizer, dispersion=dispersion, threads=-1)

# Train the model!
model.train()

You can follow this example further in the complete Getting Started tutorial.

License

Copyright 2017 the authors. The code in this repository is released under the open-source MIT License. See the file LICENSE for more details.

annieslasso's People

Contributors

andycasey avatar astrowizicist avatar davidwhogg avatar mkness avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

annieslasso's Issues

Make a comparison of duplicate observations

Sanders informs me that there are stars with the same APOGEE_ID but different ASPCAP_ID values. We stacked and analysed everything based on ASPCAP_ID. That means we can identify duplicates of APOGEE_ID values and also use them to estimate internal precision (and covariances, as Sanders is doing for distance determination).

Using GitHub Pages for the landing page

It would be ideal if we could make use of GitHub pages for our landing page at TheCannon.io. That way all the page content would be stored in the gh-pages branch of this repository.

Currently TheCannon.io domain just points to the Read the Docs page.

Only one custom domain can be used with GH pages per user or organization. I already have a GitHub pages account set up for my personal website, which means the GH page URL redirects from https://andycasey.github.io/AnniesLasso/index.html to astrowizici.st/AnniesLasso/index.html

That means we need to create an organization or username then specify TheCannon.io as our custom domain.

So, what organization name should we use, @davidwhogg? Once selected, we will have to move this repository to that new organization.

Check that l_bfgs_b output doesn't depend on convergence parameters

Here's the test I want:

  • Choose a handful of typical pixels.
  • fit with current defaults.
  • loosen pgtol by a factor of 10 and re-fit. Do the optimal parameters change by an absolute change of more than 1e-7? If so, we need to tighten pgtol.
  • return pgtol to normal and loosen factr by a factor of 10 and re-fit. Do the optimal parameters change by an absolute change of more than 1e-7? If so, we need to tighten factr

This PR can be closed when both of these tests pass.

Bonus points: You can also loosen these parameters; we really want them as loose as possible such that the above tests pass.

give Hogg the plots he needs for the APOGEE meeting

Most of what he needs will be satisfied by #36 but he also wants something which shows the non-zero parameters as a function of Lambda. Also something that shows 2-d projections of the 17-d space for the training set, and for a much larger set of stars (if we can!)

Program hangs at fitting stage in parallel

If a CannonModel (or sub-class) is instantiated with a multiprocessing pool, there are situations where the code will hang when trying to fit spectra. This situation is irrecoverable; the program will never finish. However, it will work perfectly fine in serial.

This situation arises because of Accelerate on OSX, and impacts all Python 2+ and 3+ versions. Specifically, np.dot does not work in parallel under certain conditions (e.g., here).

In Python 2 the workaround is to compile against a different BLAS (see numpy/numpy#4776).

In Python 3 you can also resolve this situation by just using multiprocessing in 'spawn' mode. The way to do this (before importing The Cannon) is to use:

import multiprocessing as mp
mp.set_start_method('spawn')

Since this problem is difficult to diagnose, I've opened this issue in case someone comes searching for these symptoms. However this issue can't be fixed by the numpy group or us, so I'm immediately closing it. (It also encourages me to drop Python 2.7 support and enforce the 'spawn' method for Python 3..)

Should we be adding higher powers of the labels?

For example, can we craft the right question to inform us whether it is worthwhile including Teff^4, or say log(Teff), etc. Regularization can help, but these questions need to be thought out properly so we don't get misinformed by cross-terms.

how to choose one, uniform regularization Lambda

DUDE. If we are going to pick one overall Lambda, appropriate for all pixels, and use the same Lambda everywhere, we should use the Lambda that minimizes the SUM of the one-d plots we are making for all the pixels (that we can afford to test).

Bounding constraints on `theta` when solving.

If the vectorizer is scaled, do we have good reason to specify boundary conditions on theta parameters when simultaneously optimizing them with the scatter term? (None of this will matter if analytic derivatives are included, but it is worth pondering)

Continuum

Re-normalize continuum for all stars (training and test), according to the Ness-ish method.

Jacknife the training set

This is trivial to do and would tell us about the (scale of the) uncertainty in the labels that results from having a finite sample size.

This could be its own project if one wanted to look into the uncertainties of the training set labels, etc.

Should we support Python 2.x?

There are two reasons I've encountered so far that make me think we should not:

  1. The CannonModel class uses pickle to save and load objects. Python 3 has a newer pickling protocol for saving Python objects, which is incompatible with previous versions. For that reason, any model saved in Python 3 using the highest available protocol (which is the current default in our code base, for many reasons) will not be loadable in Python 2.x. Someone would have to load it in Python 3, then save it again using a lower protocol. The newer protocol has faster read/write times and much smaller file sizes.
  2. See #40, although that is not strictly a 'Python 2' problem but more of an 'BLAS' problem which cannot be fixed by pure-Python code in Python 2.x.

@davidwhogg, @mkness: comments welcome!

coefficients/theta and pixel variance

this is a nomenclature question for @davidwhogg:

when trained, a CannonModel currently has attributes model.coefficients and model.scatter.

I'm refactoring the code to calculate the inverse variances and use them in the right way, and in doing so I actually solve for s^2. Should model.coefficients and model.scatter be renamed, and if so, to what?
model.theta and model.pixel_variance? something else?

make issues from things arising at APOGEE meeting

  • using The Cannon to measure radial velocities for RRL and other stars
  • using The Cannon to identify spectroscopic binaries automatically
  • comparing The Cannon single-pixel internals to ASPCAP for the same
  • using the LSF (or LSF kernels) at test time
  • determining rotation, microturbulence, macroturbulence, and radial-velocity shifts at test time
  • outputting The Cannon spectral prediction for every APOGEE spectrum along with labels
  • producing DR13-trained output for internal use by the APOGEE team

almost every pixel is masked for many training-set stars!

Here is an example: 2M06123730+4036001

...but there are many more. What is going on? How can ASPCAP give us "good" labels for a star that almost no pixel is good in any sub-visit. Literally not one good measurement of any wavelength at all.

re-starts at test time

Right now we do a single optimization at test time. This can't be right, since the optimization is non-convex. Also, Anna Ho finds that for LAMOST there are wrong answers caused by local minima. We should define a set of K initializations and start from all K, and choose the best answer (best in a likelihood sense). This, unfortunately, is high priority...

explain to Hogg the fork switch

@andycasey how did you switch things to it went from your repo being a fork of mine to mine being a fork of yours? Was that a "setting" change in both repos? Or what? It seems like it must be a github change (not a git change). Anyway, both I and @dfm want to know.

no polynomials!

We will fit continuum with sums of sines and cosines, not polynomials, for very important reasons!

synchronize github with SDSS svn

we need to have the code duplicated within SDSS to become an official part of DR14. Timescale for completion: mid-2016.

Also, it would be cool to make this very easy to maintain and update.

Optimization issue when training regularized Cannon models

The following image suggests that the optimization is not converging during the training phase for regularization models with high Lambda values.

image

The problem appears to be unrelated to ftol or xtol. The above figure was using default ftol and xtol. If the tolerance is decreased (to 3e-5) the same behaviour remains for a single line:
image

I think this bug is fixed now: Here's the same line in a 17-label model using default optimization tolerances:

image

And here are 28 pixels in a 3-label model:
image

I also updated the code to allow for the initial theta to be provided, and that initial theta is passed on between successive steps of Lambda. And I've done a bunch of tests (in a 3-label model and a 17-label model) since fixing the bug, and BFGS works well until we hit high Lambda, at which point BFGS returns a warning. I've implemented the best-of-both: BFGS will run first, and if BFGS exits with a warning then Powell's method is run instead.

Implement wavelength censoring

Should this be on vectorizer terms or label names? If it were on label names (e.g., [Al/H]) such that any cross terms with [Al/H] would also be zero. Which conceptual filter were you talking about @davidwhogg: label names or terms or both?

make a basic The Cannon landing page

This should have

  • license rider that says cite Ness et al and Casey et al in prep
  • list of projects that we want to do, with a few sentences about each
  • list of example projects that have used TheCannon

@mkness and @davidwhogg can help write the second two items.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.