Describe the bug When running TLS on light curves from the prime K

TLS failing for Kepler prime mission data about tls HOT 5 OPEN

lgbouma commented on August 24, 2024

TLS failing for Kepler prime mission data

from tls.

Comments (5)

lgbouma commented on August 24, 2024

PS: thank you for writing this very useful code!

from tls.

hippke commented on August 24, 2024

Hi Luke, sorry you're having trouble.
I don't know what's causing this, but I suspect that the signal is overshadowed by outliers.
When I clean the data like this:

from astropy.stats import sigma_clip
flux = sigma_clip(flux, sigma_upper=3, sigma_lower=3)

Then I get a strong signal with a period of 7.20195 d.
The phase-fold looks like a transit:

This doesn't solve the question, but may be a work-around. Questions to better understand the issue:

Can you try more agressive filtering of your data?
For how many light curves does this come up?

from tls.

lgbouma commented on August 24, 2024

Hi Michael,

Thank you for this response! Responding to your questions:

Can you try more aggressive filtering of your data?
Sure, the filtering can be arbitrarily aggressive... up to the usual point of not wanting to overfit transits!

For how many light curves does this come up?
I've tried running TLS on 4 Kepler light curves so far, and I got this behavior for 2 of them -- it worked for Kepler-1627 and Kepler-1643, and failed for KOI-7368 and KOI-7913. The latter two have the weakest transit signals.

Possible solution

I was intrigued though that sigma clipping helped mitigate the issue. The power spectrum I got after sigma clipping as you suggested looked like the following (I've even added axis labels... where "power" means SDE!)

so indeed, the (injected) 7.202 day signal was recovered. As a quick check, I went ahead and tried masking the 7.202 day signal, to see whether the (real) 6.843 day signal would also be recoverable. I again found though that TLS yielded fixed chi^2 for all trial periods -- it failed in the same way. This was surprising to me since in the BLS spectra from my initial message, the 6.843 day signal was the dominant one!

I thought through this a bit more, and eventually formulated the idea that the issue must be tied to the definition of chi^2, and the comparison being done in core.lowest_residuals_in_this_duration between the transit model and a straight line
(permalink). chi^2 only has so many components, so I thought: what if the issue is with the definition of the uncertainties? TLS by default assumes dy = np.std(y). Perhaps this isn't the correct definition for these processed Kepler light curves (the subtraction of stellar rotation signals can yield large residuals which drive up the standard deviation).

This turned out to be a good idea, I think. If I set the uncertainties as say the mean point-to-point RMS across the light curve, I get an SDE spectrum that makes more sense -- a) there's an actual noise floor, and b) both the 7.202 and 6.843 day signals are there. The same procedure actually then successfully finds both signals:

I think this means this might be "user error" more than a bug with TLS. However I wonder whether it is indeed the case that assuming dy = np.std(y) is the most robust assumption (validate.py, L40). I'll leave this issue open so that you can consider whether you want to change this assumption in a future patch, but feel free to consider this particular case closed. Thank you for your help!

-Luke

PS: here's the updated MWE that finds both the real planet in the KOI-7368 light curve from above, as well as the injected one. The key fix was passing an explicit set of dy values.

import pandas as pd, numpy as np, matplotlib.pyplot as plt
from transitleastsquares import transitleastsquares
import multiprocessing as mp
from astropy.stats import sigma_clip
n_threads = mp.cpu_count()

def p2p_rms(flux):
    dflux = np.diff(flux)
    med_dflux = np.nanmedian(dflux)
    up_p2p = (
        np.nanpercentile( np.sort(dflux-med_dflux), 84 )
        -
        np.nanpercentile( np.sort(dflux-med_dflux), 50 )
    )
    lo_p2p = (
        np.nanpercentile( np.sort(dflux-med_dflux), 50 )
        -
        np.nanpercentile( np.sort(dflux-med_dflux), 16 )
    )
    p2p = np.mean([up_p2p, lo_p2p])
    return p2p

csvpath = 'KOI-7368_npoints50000.csv'
df = pd.read_csv(csvpath)
time, flux = np.array(df.time), np.array(df.flux)

# NOTE: not actually necessary for TLS to converge, but a good idea
# regardless because it lowers the noise floor in the periodogram
flux = sigma_clip(flux, sigma_upper=3, sigma_lower=3)
stime = time[~flux.mask]
sflux = flux[~flux.mask]

dy = np.ones_like(stime)*p2p_rms(sflux)
model = transitleastsquares(stime, sflux, dy=dy, verbose=True)
results = model.power(
    use_threads=n_threads, show_progress_bar=True, R_star=0.88,
    R_star_min=0.85, R_star_max=0.91, M_star=0.88, M_star_min=0.85,
    M_star_max=0.91, period_min=5, period_max=10, n_transits_min=10,
    transit_template='default', transit_depth_min=10e-6, oversampling_factor=5
)

plt.close("all")
plt.plot(results['periods'], results['power'], lw=0.5)
plt.xlabel('period [days]')
plt.ylabel('power')
plt.savefig('tls_mwe_sigclip_run1.png', dpi=400)

from transitleastsquares import transit_mask, cleaned_array

intransit = transit_mask(stime, results.period, 2*results.duration, results.T0)

y_second_run = sflux[~intransit]
t_second_run = stime[~intransit]
t_second_run, y_second_run = cleaned_array(t_second_run, y_second_run)

dy = np.ones_like(t_second_run)*p2p_rms(y_second_run)
model2 = transitleastsquares(t_second_run, y_second_run, dy=dy, verbose=True)
results2 = model2.power(
    use_threads=n_threads, show_progress_bar=True, R_star=0.88,
    R_star_min=0.85, R_star_max=0.91, M_star=0.88, M_star_min=0.85,
    M_star_max=0.91, period_min=5, period_max=10, n_transits_min=10,
    transit_template='default', transit_depth_min=10e-6, oversampling_factor=5
)

print(results2['period'])

plt.close("all")
plt.plot(results2['periods'], results2['power'], lw=0.5)
plt.xlabel('period [days]')
plt.ylabel('power')
plt.savefig('tls_mwe_sigclip_run2.png', dpi=400)

from tls.

hippke commented on August 24, 2024

Hey, that's a very cool find. I totally agree that the issue is non-Gaussian errors. We could add a feature to TLS to calculate uncertainty in a more sophisticated way, and move away from std in case errors are non Gaussian. What is the method p2p_rms in your script? What would you recommend to determine non-Gaussian uncertainty?

from tls.

lgbouma commented on August 24, 2024

I can imagine two alternatives (though I'm sure there are more!).

The simpler thing is to assume that the non-Gaussianity is mostly driven by a few outliers. If so, the standard-deviation will be driven up by these outliers, but a statistic like the 84th-50th percentile, or its lower converse, will not.

Another way to produce non-Gaussianity is if the user hasn't even attempted to remove a coherent signal that is present in the light curve (e.g., few-day stellar rotation, or scattered light for TESS at the edges of orbits). The point-to-point RMS statistic I quoted above is robust to these kinds of trends in the data too, because it works on the point-to-point difference time-series, which removes such trends.

Here's the output of an example code comparing these statistics against the standard deviation:

Seed 42
Gaussian errors (σ=0.05):
P2P RMS: 0.0704, STDEV: 0.0489, 68 percentile: 0.0481
Gaussian errors + outliers:
P2P RMS: 0.0806, STDEV: 0.2176, 68 percentile: 0.0520
Gaussian errors + outliers + line:
P2P RMS: 0.0806, STDEV: 0.2810, 68 percentile: 0.2156

Code that generates this is as follows.

import numpy as np

def sixtyeight(flux):
    """
    Calculate the 68th percentile of the distribution of the flux.
    """
    med_flux = np.nanmedian(flux)

    up_p2p = (
        np.nanpercentile( flux-med_flux, 84 )
        -
        np.nanpercentile( flux-med_flux, 50 )
    )
    lo_p2p = (
        np.nanpercentile( flux-med_flux, 50 )
        -
        np.nanpercentile( flux-med_flux, 16 )
    )

    p2p = np.mean([up_p2p, lo_p2p])

    return p2p

def p2p_rms(flux):
    """
    Calculate the 68th percentile of the distribution of the residuals from the
    median value of δF_i = F_{i} - F_{i+1}, where i is an index over time.
    """
    dflux = np.diff(flux)
    med_dflux = np.nanmedian(dflux)

    up_p2p = (
        np.nanpercentile( dflux-med_dflux, 84 )
        -
        np.nanpercentile( dflux-med_dflux, 50 )
    )
    lo_p2p = (
        np.nanpercentile( dflux-med_dflux, 50 )
        -
        np.nanpercentile( dflux-med_dflux, 16 )
    )

    p2p = np.mean([up_p2p, lo_p2p])

    return p2p

for seed in [42, 3141]:

    print(42*'=')
    print(f"Seed {seed}")

    np.random.seed(seed)
    n_points = 1000
    scale = 0.05

    err = np.random.normal(loc=0, scale=scale, size=n_points)
    y = np.ones(n_points) + err

    print(f"Gaussian errors (σ={scale}):")
    print(f"P2P RMS: {p2p_rms(y):.4f}, STDEV: {np.std(y):.4f}, 68 percentile: {sixtyeight(y):.4f}")

    outlier_ind = np.unique(np.random.randint(0, n_points, size=50))
    outlier_val = np.random.uniform(low=-0.1, high=0.1, size=len(outlier_ind))

    y[outlier_ind] = outlier_val

    print("Gaussian errors + 5% outliers:")
    print(f"P2P RMS: {p2p_rms(y):.4f}, STDEV: {np.std(y):.4f}, 68 percentile: {sixtyeight(y):.4f}")

    line = np.linspace(-0.3,0.3,n_points)
    y += line

    print("Gaussian errors + 5% outliers + line:")
    print(f"P2P RMS: {p2p_rms(y):.4f}, STDEV: {np.std(y):.4f}, 68 percentile: {sixtyeight(y):.4f}")

from tls.

TLS failing for Kepler prime mission data about tls HOT 5 OPEN

Comments (5)

Possible solution

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent