andycasey / annieslasso Goto Github PK
View Code? Open in Web Editor NEWThe Cannon 2: Compressed sensing edition
License: MIT License
The Cannon 2: Compressed sensing edition
License: MIT License
(so that they can be loaded from elsewhere without having the huge data files)
Most of what he needs will be satisfied by #36 but he also wants something which shows the non-zero parameters as a function of Lambda. Also something that shows 2-d projections of the 17-d space for the training set, and for a much larger set of stars (if we can!)
There are two reasons I've encountered so far that make me think we should not:
CannonModel
class uses pickle to save and load objects. Python 3 has a newer pickling protocol for saving Python objects, which is incompatible with previous versions. For that reason, any model saved in Python 3 using the highest available protocol (which is the current default in our code base, for many reasons) will not be loadable in Python 2.x. Someone would have to load it in Python 3, then save it again using a lower protocol. The newer protocol has faster read/write times and much smaller file sizes.@davidwhogg, @mkness: comments welcome!
...Hogg supply details here
In very high label space it may not even be enough to say 'high chi-squared value', if the model is sufficiently flexible.
Either we have to implement the @mkness hack or else do something better, but we need something here.
The idea is to fit the labels and a few more parameters. Hogg has a trivial implementation idea.
This is assigned to @mkness ; Ask for instructions from @andycasey
For example, can we craft the right question to inform us whether it is worthwhile including Teff^4
, or say log(Teff)
, etc. Regularization can help, but these questions need to be thought out properly so we don't get misinformed by cross-terms.
If the vectorizer
is scaled, do we have good reason to specify boundary conditions on theta
parameters when simultaneously optimizing them with the scatter term? (None of this will matter if analytic derivatives are included, but it is worth pondering)
Here is an example: 2M06123730+4036001
...but there are many more. What is going on? How can ASPCAP give us "good" labels for a star that almost no pixel is good in any sub-visit. Literally not one good measurement of any wavelength at all.
Here's the test I want:
pgtol
by a factor of 10
and re-fit. Do the optimal parameters change by an absolute change of more than 1e-7
? If so, we need to tighten pgtol
.pgtol
to normal and loosen factr
by a factor of 10
and re-fit. Do the optimal parameters change by an absolute change of more than 1e-7
? If so, we need to tighten factr
This PR can be closed when both of these tests pass.
Bonus points: You can also loosen these parameters; we really want them as loose as possible such that the above tests pass.
@mkness what exactly do we do to set these? Unfortunately this is urgent! We are working with spectra from apStar
files.
We will fit continuum with sums of sines and cosines, not polynomials, for very important reasons!
This should have
@mkness and @davidwhogg can help write the second two items.
Should this be on vectorizer terms or label names? If it were on label names (e.g., [Al/H]) such that any cross terms with [Al/H] would also be zero. Which conceptual filter were you talking about @davidwhogg: label names or terms or both?
this is so i dont forget
We need to produce not just labels but also the best-fit prediction for every star. This increases disk space but is a very valuable data product.
consider using ridge regression for regularization at the test step:
\argmin[\chi(l)^2 + \lambda(v(l).w)^2]
If a CannonModel
(or sub-class) is instantiated with a multiprocessing pool, there are situations where the code will hang when trying to fit spectra. This situation is irrecoverable; the program will never finish. However, it will work perfectly fine in serial.
This situation arises because of Accelerate on OSX, and impacts all Python 2+ and 3+ versions. Specifically, np.dot
does not work in parallel under certain conditions (e.g., here).
In Python 2 the workaround is to compile against a different BLAS (see numpy/numpy#4776).
In Python 3 you can also resolve this situation by just using multiprocessing
in 'spawn' mode. The way to do this (before importing The Cannon) is to use:
import multiprocessing as mp
mp.set_start_method('spawn')
Since this problem is difficult to diagnose, I've opened this issue in case someone comes searching for these symptoms. However this issue can't be fixed by the numpy
group or us, so I'm immediately closing it. (It also encourages me to drop Python 2.7 support and enforce the 'spawn' method for Python 3..)
For presenting results.
@andycasey write a couple of paragraphs for our introduction about the things you have learned from Gaia-ESO that suggest the value of TheCannon.
I don't like the @mkness list of continuum pixels. This is outside of scope for paper 1, so this is an enhancement
The following image suggests that the optimization is not converging during the training phase for regularization models with high Lambda values.
The problem appears to be unrelated to ftol
or xtol
. The above figure was using default ftol
and xtol
. If the tolerance is decreased (to 3e-5
) the same behaviour remains for a single line:
I think this bug is fixed now: Here's the same line in a 17-label model using default optimization tolerances:
And here are 28 pixels in a 3-label model:
I also updated the code to allow for the initial theta
to be provided, and that initial theta is passed on between successive steps of Lambda
. And I've done a bunch of tests (in a 3-label model and a 17-label model) since fixing the bug, and BFGS
works well until we hit high Lambda
, at which point BFGS
returns a warning. I've implemented the best-of-both: BFGS
will run first, and if BFGS
exits with a warning then Powell's method is run instead.
@andycasey how did you switch things to it went from your repo being a fork of mine to mine being a fork of yours? Was that a "setting" change in both repos? Or what? It seems like it must be a github change (not a git change). Anyway, both I and @dfm want to know.
It would be ideal if we could make use of GitHub pages for our landing page at TheCannon.io. That way all the page content would be stored in the gh-pages
branch of this repository.
Currently TheCannon.io domain just points to the Read the Docs page.
Only one custom domain can be used with GH pages per user or organization. I already have a GitHub pages account set up for my personal website, which means the GH page URL redirects from https://andycasey.github.io/AnniesLasso/index.html to astrowizici.st/AnniesLasso/index.html
That means we need to create an organization or username then specify TheCannon.io as our custom domain.
So, what organization name should we use, @davidwhogg? Once selected, we will have to move this repository to that new organization.
right now we are not sure how to edit the inverse variance, given the pixel masks
We should offset using a label offset to a fiducial label value (as we do).
But we should scale the K + K * (K + 1) / 2
vectorizer-output terms independently.
Ask @davidwhogg for details.
we have everything we need to do this. This is our major stumbling block for paper 0.
Sanders informs me that there are stars with the same APOGEE_ID but different ASPCAP_ID values. We stacked and analysed everything based on ASPCAP_ID. That means we can identify duplicates of APOGEE_ID values and also use them to estimate internal precision (and covariances, as Sanders is doing for distance determination).
This is trivial to do and would tell us about the (scale of the) uncertainty in the labels that results from having a finite sample size.
This could be its own project if one wanted to look into the uncertainties of the training set labels, etc.
Ideally, RA, Dec, and our 17 labels for all APOGEE stars, for some setting of f and Lambda.
This is high priority!
Right now we do a single optimization at test time. This can't be right, since the optimization is non-convex. Also, Anna Ho finds that for LAMOST there are wrong answers caused by local minima. We should define a set of K initializations and start from all K, and choose the best answer (best in a likelihood sense). This, unfortunately, is high priority...
this is a nomenclature question for @davidwhogg:
when trained, a CannonModel
currently has attributes model.coefficients
and model.scatter
.
I'm refactoring the code to calculate the inverse variances and use them in the right way, and in doing so I actually solve for s^2
. Should model.coefficients
and model.scatter
be renamed, and if so, to what?
model.theta
and model.pixel_variance
? something else?
...not the spectra normalized by ASPCAP.
we need to have the code duplicated within SDSS to become an official part of DR14. Timescale for completion: mid-2016.
Also, it would be cool to make this very easy to maintain and update.
Re-normalize continuum for all stars (training and test), according to the Ness-ish method.
If you log in to Gandi, you should see that you are the admin for thecannon.io. Do your magic.
This critical
DUDE. If we are going to pick one overall Lambda, appropriate for all pixels, and use the same Lambda everywhere, we should use the Lambda that minimizes the SUM of the one-d plots we are making for all the pixels (that we can afford to test).
We need to consistently use unambiguously different names for the components of theta, the stellar parameters, the stellar abundances, and labels. And be consistent.
for @andycasey ; see http://gandi.net/ .
..
That unit test should be able to test L1, chi-squared, and the whole objective function.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.