sweverett / clustr Goto Github PK

5.0 5.0 4.0 457 KB

Calculates various scaling relations from cluster catalogs.

Python 100.00%

clustr's Introduction

CluStR

This package calculates various scaling relations from cluster catalogs. While only appropriate for fitting power laws, it uses a methodology developed by Brandon Kelly for a fully Bayesian approach that incorporates correlated and heteroscedastic measurement errors, intrinsic scatter, selection effects, censored data, and gaussian mixture modeling for the covariates.

For more information on the Kelly method, read his 2007 paper, check out his original IDL implementation linmix_err, or see the python port linmix by Josh Meyers, which was used in this work.

Getting Started (using `conda env`)

First, you will need conda.

Once you have that, clone this repository into a local directory, and switch to that directory:

git clone https://github.com/sweverett/CluStR
cd CluStR

Next, create a conda environment using the supplied environment.yml:

conda env create -f environment.yml

Now activate the newly-created conda environment:

conda activate clustr

Whenever you are finished, deactivate the conda environment with

conda deactivate clustr

Getting Started (using `pip` and `conda`)

First you will need to clone the repo by moving to the desired local directory and using

git clone https://github.com/sweverett/CluStR

Dependencies

Besides some standard packages like numpy, scipy, and matplotlib that can be aquired through common distributions or pip, CluStR requires the following python packages be installed:

Note that astropy is now included in Anaconda.

You can look at these links for details, or simply paste the following into terminal:

pip install astropy
pip install corner
pip install pypdf2

The simplest way to get linmix is to clone the repo and install using the given setup file:

git clone https://github.com/jmeyers314/linmix.git
cd linmix
python setup.py install

Now you should be ready to use CluStR!

Config File

Most parameters are set in the config.yml file. Here you can set the cosmology, default regression method, plotting options, and most importantly any desired flags. There are three possible flag types: bool, cutoff, and range. For each, you must specify the exact catalog column name you want to make cuts along with the flag type and, if a cutoff or range, the corresponding cut values. All name:value pairs must be separated by a colon.

There are two important things to note that might be unclear:

Setting a flag type and cut value does not mean the cut will be used! A flag is set to be used in the actual method call - see Example Use below. This allows you to set many flag parameters without having to change the config file everytime you want to use a different combination of flags.
While it may seem counter-intuitive at first, the flag parameters are set to what data you want to keep, not remove - i.e. set what redshift range you want your clusters to be found in rather than what ranges you want to remove. I found this to eliminate some mental gynmastics when setting cuts but may feel awkward for bools.

Here is an example for each flag type:

Bool:

<column_name>_bool_type: <True/False>

To only include clusters that are not within r500 of a chip edge, use

edge_exclude_r500_bool_type: False

In other words - only use data that is not flagged with edge_exclude_r500.

Cutoff:

<column_name>_cut_type: <above/below>

<column_name>_cut:

To analyze clusters whose redshift is below 1.0, use

redshift_cut_type: below
redshift_cut: 1.0

Range:

<column_name>_range_type: <inside/outside>

<column_name>_range_min:

<column_name>_range_max:

To only use clusters with redshift between 0.3 and 0.5, use

redshift_range_type: inside
redshift_range_min: 0.3
redshift_range_max: 0.5

This system may seem inefficient, but allows for quite a bit of flexibility in selecting interesting data subsets.

Example Use

python clustr.py <catalog.fits> <config.yml>

CluStR has four mandatory inputs: An appropriate cluster FITS file, the covariate variable (x-axis), the response variable (y-axis), and a configuration file. The available columns for axis variables (on the right) and their corresponding input labels (on the left) are:

lambda : lambda (richness)
l500kpc : 500_kiloparsecs_band_lumin
lr2500 : r2500_band_lumin
lr500 : r500_band_lumin
lr500cc : r500_core_cropped_band_lumin
t500kpc : 500_kiloparsecs_temperature
tr2500 : r2500_temperature
tr500 : r500_temperature
tr500cc : r500_core_cropped_temperature

To plot the scaling relation between r2500 temperature and richness, use

python clustr.py <catalog.fits> lambda tr2500 config.yml

The output file will be named <default_prefix>r2500_temperature-lambda.pdf, where you can set the default prefix in config.yml.

Additionally there are other optional arguments: A filename prefix (-p). As described in the Config File section, flag paramters are set in config.yml but are only used if set to True.

License

This project is licensed under the MIT License - see the LICENSE.md file for details

clustr's People

Stargazers

Watchers

Forkers

devonhollowood jjobel cosmology-man mattkwiecien

clustr's Issues

Check Dependencies

CluStR should automatically check for all dependencies and ask the user if they would like them installed through conda or pip, or for the user to do it themselves. Alternatively, the packages could always be included in the directory if small.

run_options

I am getting this error for run_options. I'm not sure what run_options was supposed to refer to, but do you know what I should put in the main for this? thanks!

Traceback (most recent call last):
File "clustr.py", line 256, in
main()
File "clustr.py", line 239, in main
config = Config(config_filename) #(2)
TypeError: init() missing 1 required positional argument: 'run_options

using parts other peoples code?

Can we use parts of Brandon Kelly's code for the fitting class we have? or is that not allowed?

Plotlib.py Rework

Compared to the master branch, we need a plotlib.py that works for clustr.py.
Currently, plotlib.py

Includes global variables
No Classes
clustr.py in rewrite does not
- import plotlib
- Make a plot using plotlib

Don't hard-code flags

There's no reason to hard-code the list of possible flags--the parameter configuration file already tells us everything we need to know about new flags at runtime, so hard-coding the flags just makes the library less extensible.

CluStR rewrite!

Now that CluStR is being used by the group again it's time for some updates! Or rather, let's use all that learned python/astro knowledge of the past 3 years and write a new, simplified, more extensible, and more general code base that wraps Kelly's linimx package for regression on arbitrary columns in fits catalogs. We'll still need to implement some cluster-specific features but this can be accomplished with a pre-existing config structure such as yaml.

This will also serve as a summer project for Paige (will link once she has an account!)

Here's a TODO list that we will update as needed:

Combine different cluster files into a single clustr.py (@paigemkelly @jjobel)
Setup new environment file, config file, and IO processing (@sweverett)
Make new main() function w/ object-oriented structure
Implement all new classes:
- Config
- Catalog
- Data
- Fitter
Implement new flag structure used by Config in Catalog or Data to apply cuts (@sweverett)

plotting

Should we build off the plotlib.py code you already wrote? or start fresh with the plotting. Jose and I have only done basic plots, so we were gonna take a class or something to figure out how to do that.

Fix incorrect scatter plot legend

In the legend, xpiv needs to be replaced with e^xpiv.

Switch flag type to enum

Currently, we compare flag types to strings, which is somewhat error-prone. Once we implement #12, we should switch over to using Enums, which are cleaner and less error-prone.

LRGS Chain Convergence

Not a priority at the moment - but it would be nice to implement automatic convergence tests for the lrgs method, as there have been convergence issues under certain failure modes. See #7 .

to do

put the 3 masks into one
move the list of allowed flags (boolean, range, snr) into the config file and access it from there
reformat the config file to yaml format
add at the end of each flag statement something that removes the NaNs
double check error handling with Aryas code
check for mistake that is making b, m, sigma different
make a True/False for boolean flags

Complete first-pass run of updated pipeline

It's hard to keep momentum up when trying to finish all the complicated parts all at once - so let's focus on getting the script to run with minimal functionality like loading a catalog and making a plot of the x vs. y we want to fit. We can try to pass it to the linmix fitter as well!

I'm using the milestone feature here; let's plan on closing this issue by then - a week after our MCMC meeting this Wednesday.

Complete Readme

The repository needs a complete Readme with:

General description of CluStR
User instructions
Example usage, possibly with Jupyter notebook
Possibly a parameter and makefile

fitter class

Should we use the old file for the fit, or rewrite a new one into the clustr.py directly into the fitter class?

Residual Plots

It would be nice for one of the plot options to be a series of residual plots. May add options later, but for now will plot all options.

Make PEP 8 compatible

Let's bring this repository up to the general Python standards by making it PEP 8 compatible! This can be done by running pylint or flake8 on the repository and making the suggested changes.

Check Flagged Data Consistency

We should make sure that the flagged data removal methods are working as intended - would be instructive to make plots of flagged data as well as the data that survives the flagging.

getting rid of R

where is R being used in the code? is it only in reglib.py lines 9-16?

Rewrite branch needs to consolidate code

Right now the rewrite branch has the main code split into two copies - clustr.py and newnewclustr.py. That's no good! We need to consolidate into a single file that we all work off, namely clustr.py (though we can keep some of the new implementations from newnewclustr.py).

Taking a quick look at the two copies, I suggest that we use keep the following from each file:

clustr.py:

Config
The additional argparse arguments in the ArgumentParser class, but move them into the correct format in newnewclustr.py
E(z) function (this is a cosmological function)
The get_data() function in the Catalog class (which we will co-opt into the new Data() constructor

newnewclustr.py:

The new parser structure (at the top of the file)
main() function, as it's using the new OO design
Catalog, as we're restructuring it in the new design

Remember that we're not going to 'lose' anything by consolidating - the main code is still in the master branch! We can still reuse things from the main code if we find them helpful, but we're also not forced to use it.

MCMC Chain Input and Verbose

Preferably, the Kelly and Mantz method MCMC chain length should be an inputted parameter, with the Kelly method defaulting to say 5,000 and the Mantz method to 1,000 (NB: even at 1,000, the Mantz method is significantly slower). If possible, interface with linmix and lrgs to modify verbose levels.

Flags for Rewrite

I was thinking of making a function that is not dependent on terminal arguments but rather relies on the param.config file such that the user sets the flags they want applied to True in the param.config file. Since the param.config file contains many boolean values, the function would need a list of flag column labels to reference. If such column label is set to True in the config file, then the function proceeds to remove all rows that contain [row, col] values we don't want included in our analysis.

For example, suppose we have this data frame df:

|Name       | Merger |
| catalog_1 |    0   |
| catalog_2 |    1   |
| catalog_3 |    0   |

0 means not a merger
1 means yes a merger

If user sets merger=True in the param.config, then we want to remove all rows with df["merger"] = 1. We would then have a new dataframe df of only good rows.

|Name       | Merger |
| catalog_1 |    0   |
| catalog_3 |    0   |

I understand how to reference the config file and check if labels are set to True. I've uploaded a naive approach for doing that but I'm having a hard time figuring out how to cut the rows from the data. I've been referencing the master branch for inspiration but I get lost on how mask = np.zeros(len(data), dtype=bool) and mask |= cut is being used in both the create_cuts(data, flags) function and get_data(options) function.

Add richness value checks

There have been some issues with handling bad cluster data with negative richness values - should add a check for this in get_data(). Can think of other common covariate checks to add as well.

LRGS Failure Modes

Whenever there is time, it would be nice to come up with a notebook displaying some of the failure modes of lrgs that we have found, especially the case of large scatter compared to measurement error.

Flag Handling

CluStR should sort any inputted flag to be a boolean, range, or cutoff type automatically and remove the flagged data accordingly. Each flag type needs its own method:

Boolean
Range
Cutoff

Pickle directory!

Check to see if the /pickle directory is being automatically created if needed. Oops!

LRGS Gaussian Plots

There should be an optional plot that displays lrgs's best guess at the Gaussian mixtures, as well as reporting how many Gaussians were used for most plots.

x axis wrong?

In the plot that Lena made on the plane, the lambda values of at least a couple of clusters seem to be wrong. Most noticeable are two clusters with Tx>10 which appear in the plot with lambda ~ 35, but in the catalog actually have lambda ~ 55.
Noner2500_temperature-lambda.pdf

Python 3 Compatability

We should probably convert everything to be Python 3 compatible, just to be forward looking :)

fit function of fitter class won't run

the fit function will not run in the fitter class. the output when this is in the main: "fits = Fitter(viable_data, plot_filename)" is only "test1" so it runs the init part then stops.

class Fitter(object):
def init(self, data, plotting_filename):
self.viable_data = data
self.plotting_filename = plotting_filename
print('test1')
def fit(self):
print('test2')
x_obs = self.viable_data[0]
y_obs = self.viable_data[1]
x_err = self.viable_data[2]
y_err = self.viable_data[3]
#run linmix
print ("Using Kelly Algorithm...")
kelly_b, kelly_m, kelly_sig = reglib.run_linmix(x_obs, y_obs, x_err, y_err)
print(kelly_b)
#use before plotting
log_x = np.log(x_obs)
x_piv = np.median(log_x)
log_y = np.log(y_obs)

    return [log_x-x_piv, log_y, x_err/x_obs, y_err/y_obs, x_piv]

Parameter File Options

To simplify things, the param.config file should set cutoff and range values for optional flags (see #2 ), what plots are to be saved to the output file, and default values for certain inputs such as the catalog and variables.

Censored Data

While linmix can handle censored data, this feature has not yet been implemented in CluStR. This will likely be done with a masking array from the inputted catalog.

Fix plotting aesthetics

Newer versions of matplotlib change the defaults to make the plots look worse - fix this when there is time!

https://matplotlib.org/users/dflt_style_changes.html

Keyerror Raised

It seems that the the key value x_label is not found in the config file when running the following code,

def get_data(self, config, catalog):
        xlabel = fits_label(config['x_label'])
        ylabel = fits_label(config['y_label'])

From my understanding, the xlabel references the fits_label function while taking in as input config['x_label'] as the axis name but I end up with the following error,

KeyError: 'x_label'

Which I found to mean that I'm trying to access a key that is not in the dictionary. I've also tried using the following,

def get_data(self, config, catalog):
        xlabel = fits_label(Config.__getitem__(x_label))
        ylabel = fits_label(Config.__getitem__(y_label))

I now get the following error,

NameError: name 'x_label' is not defined

Essentially, I'm trying to figure out how to obtain x and y for the Data class.

Use warnings module

We should use the warnings module to raise our warnings.

File Name Options

While there is a default naming scheme and optional prefix parameter, it would be nice to have a full file name parameter option in param.config. Could use the current scheme as the default if nothing is specified.

SNR flag

I added in the SNR ratio flag to cut data <9. 840 clusters are being removed. I am trying to figure out why the RuntimeWarning is happening.

Removed 9 clusters due to bad_mode flag of <class 'bool'>
Removed 15 clusters due to overlap_r500 flag of <class 'bool'>
Removed 7 clusters due to overlap_r2500 flag of <class 'bool'>
Removed 41 clusters due to edge_r2500 flag of <class 'bool'>
Removed 27 clusters due to overlap_bkgd flag of <class 'bool'>
Removed 52 clusters due to edge_bkgd flag of <class 'bool'>
Removed 39 clusters due to masked flag of <class 'bool'>
/home/paige/anaconda3/lib/python3.7/site-packages/astropy/table/column.py:984: RuntimeWarning: invalid value encountered in less
result = getattr(super(), op)(other)
Removed 840 clusters due to 500_kiloparsecs_SNR flag of <class 'str'>
Removed 0 clusters due to Redshift flag of <class 'str'>

NOTE: Removed counts may be redundant, as some data fail multiple flags.
Accepted 191 data out of 1092

mean x error: 8.974993467153285e+42
mean y error: 4.850821240875912

Cosmology Check

While the cosmological parameters are now possible inputs, it is not explicitly clear to the user that a cosmology is being assumed. Perhaps should make a message asking if the assumed cosmology is acceptable - although this should be skippable for scripting purposes.