Code Monkey home page Code Monkey logo

baseband-tasks's People

Contributors

00rebe avatar cczhu avatar luojing1211 avatar mhvk avatar pllim avatar thexyzt avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

baseband-tasks's Issues

EOFError while reading using baseband

I am having trouble reading data while using scintillometry and baseband. I am not sure whether the error is in one or the other, but I have my suspicions that it is the former.
Note that for previous versions of both softwares the package ran fine with the same dataset.

My software basically wraps the functions for reading and changing shape (ChangeSampleShape), these are:

def telescope_shape(fh, telescope):
    """
    Reshape data for ARO or CHIME with a file handler. This is the standard way
    to transform data to V[time, frequency, polarization]. ``A = CHIME`` and
    ``B = ARO``.

    Parameters
    ----------
    fh : `baseband.vdif.base.VDIFStreamReader`
        File-handler from `baseband`.
    telescope : `str`:
        Telescope name: ``'CHIME'`` or ``'ARO'``.

    Returns
    -------
    fh_reshaped : `baseband.vdif.base.VDIFStreamReader`
        File-handler with the new shape.
    """

    def reshape_chime(data):
        return data.reshape(-1, 256, 4, 2).transpose(
            0, 2, 1, 3).reshape(-1, 1024, 2)

    def reshape_aro(data):
        return data.swapaxes(1, 2)

    if telescope == 'CHIME' or telescope == 'A':
        fh_reshaped = ChangeSampleShape(fh, reshape_chime)

    elif telescope == 'ARO' or telescope == 'B':
        fh_reshaped = ChangeSampleShape(fh, reshape_aro)

    return fh_reshaped

and

def read_ntime(
    data_sorted, telescope, timestamp_i, ntime, freq_resolution=390625 * u.Hz,
    verify=True
        ):

    """
    Given an initial timestamp, ``timestamp_i``, read the following ``ntime``
    number of samples that follow.

    Parameters
    ----------
    data_sorted : `list`
        List of strings with the corresponding paths of files to be opened.
        Files must be sorted chronologically.
    telescope : `str`:
        Telescope name: ``'CHIME'`` or ``'ARO'``.
    timestamp_i : `astropy.time.core.Time`
        Initial timestamp from where to start reading.
    ntime : `int`
        Number of time samples of `2.56e-6` s to be opened.
    freq_resolution : `int`
        Frequency resolution, sample rate or frequency sampling in
        math:`\\mathrm{Hz}`.
    verify : `bool`
        Whether to do consistency checks on frames being read. See `baseband`
        reference. It is set default `True`.

    Returns
    -------
    V : `np.ndarray`
        Three-dimension array, shaped as ``V[ntime, frequency, polarization]``.
    limits: `list`
        List containing the `astropy.time.core.Time` of start and stop of
        `data_sorted`.
    """

    fraw = helpers.sequentialfile.open(data_sorted, 'rb')
    fh = vdif.open(fraw, mode='rs', sample_rate=freq_resolution, verify=verify)
    fh = telescope_shape(fh, telescope=telescope)

    fh.seek(timestamp_i)

    # updating timestamp with baseband
    timestamp_i_corrected = fh.tell(unit='time')

    V = fh.read(ntime)

    # checking whether the timestamp is in the sequence
    timestamp_f = fh.tell(unit='time')
    if fh.stop_time < timestamp_f:
        pass
    else:
        timestamp_f = fh.stop_time

    fh.close()

    limits = [timestamp_i_corrected, timestamp_f]

    return V, limits

While running my GP-search I encounter the same error over and over, until it ran through all the data set.

Traceback (most recent call last):
  File "/home/v/vanderli/cassane/babendil-frb-vlbi/scripts/gp_run_search.py", line 184, in <module>
    gp_thres=gp_thres
  File "/home/v/vanderli/cassane/anaconda3/lib/python3.7/site-packages/babendil-0.0.1-py3.7.egg/babendil/sigprocess/gp_search.py", line 298, in run
    V_cdd, limits = self.process_file(timestamp, ntime)
  File "/home/v/vanderli/cassane/anaconda3/lib/python3.7/site-packages/babendil-0.0.1-py3.7.egg/babendil/sigprocess/gp_search.py", line 207, in process_file
    verify=False
  File "/home/v/vanderli/cassane/anaconda3/lib/python3.7/site-packages/babendil-0.0.1-py3.7.egg/babendil/io/reader.py", line 172, in read_ntime
    V = fh.read(ntime)
  File "/home/v/vanderli/cassane/anaconda3/lib/python3.7/site-packages/scintillometry-0.0.dev394-py3.7.egg/scintillometry/base.py", line 298, in read
    raise EOFError("cannot read from beyond end of input.")
EOFError: cannot read from beyond end of input.

Located here:

https://github.com/mhvk/scintillometry/blob/e8464747599574b0a61aa662f28289c9bbbcfc03/scintillometry/base.py#L296-L298

Some other comments:

  • I set verify=False
  • Tested random timestamps and it seems to give the same problem over and over

Import all tasks at the top level?

It seems to make little sense to have to import the tasks from a given module, except perhaps the helpers and io. Just import them all at the top level?

StreamCombiner class

Would be useful both to produce input for, say, correlation, and to combine, e.g., multiple channels from a single source split over multiple files.

Create Derotation class (or add option to Dedisperse) for RM correction

Would be nice to be able to apply/correct rotation measure as one does for dispersion. From an e-mail by Dongzi (recipe slightly edited):
Here is the algorithm to correct the cable delay and RM for linear voltage data. It is straight-forward to test, just form the stokes with the corrected X and Y, and see if all the effects are removed. With the same parameter, if you are able to calibrate the stokes, you are able to calibrate the X and Y. If you want to do it more rigidly, there should be algorithm to calculate the Jones matrix given two calibration sources, it would also be straight-forward to apply (probably ask Luke for this).

  1. correct cable delay: X=X*exp(-i tau f), where tau is the cable delay and f is the frequency
  2. create circular pols L,R=0.5*(X±iY) (definition of L and R might be opposite);
  3. correct RM: L=L*exp(- i RM lambda^2), R=R *exp(i RM lambda^2)
  4. reconstruct linear: X=L+R, Y=(L-R)i

so the combined Jones matrix for the above steps will be

J=[[exp(i tau f) cos (RM lambda^2), -sin (RM lambda^2)],
   [exp(-i tau f) sin(RM lambda^2), -cos(RM lambda^2)]]

and J dot [X,Y] would give you the corrected X and Y.

But I highly recommend to break the steps and form stokes parameters for each steps and see the effect for the first burst. I might be missing some constant phase, or the convention for RM/cable delay may be off by a sign, etc. But I think you are very smart, so this will be easy for you to find out.

cc @theXYZT

Decide how to do writing, also of intermediate steps

As raised by @luojing1211 per e-mail: Once one has set up a pipeline, how to actually get it to write out its results? And perhaps also write intermediate steps?

One possibility, at least for the final step, is to create a memory-mapped output and then use that for the out argument in read. E.g., currently, the following works:

from numpy.lib.format import open_memmap
...
stack = Stack(square, nbin, phase)
mm = open_memmap('stack.npy', 'w+', dtype='f4', shape=stack.shape)
stack.read(out=mm)

FITS files similarly can be opened with memmap=True, so this should work reasonably transparently. All we would need is then to have some functions that create the right output files, filling the headers.

See also a complete pipeline at https://gist.github.com/mhvk/65944fcc8477c7a40ef10a896bb90a56

Read task out shape does not match PSRFITS data shape

In PR #119 I added the data writing functionality. The file handler shape is (nsample, nbin, npol, nchan), but the PSRFITS shape is (nsample, nbin, nchan, npol). So the

read(nsample, out = PSRFITS)

will report the bug of not the same shape. Where should the reshape happen?

Return data in a ndarray subclass with a count attribute?

For folding, we need to return not just the folded data, but also how many samples go in. For this purpose, we perhaps should return a subclass of ndarray which holds a count attribute. This could be extended to all tasks and even to baseband (mhvk/baseband#286) to take into account missing data or data removed by, e.g., RFI clipping.

It seems reasonable for this attribute not to propagate much, if at all, i.e., most operations would delete it.

Optimal reference frequency for dedispersion

@quantumfx pointed to a paper by Pennucci (https://iopscience.iop.org/article/10.1088/0004-637X/790/2/93), which includes a discussion of the optimal reference frequency for dedispersion, which minimizes covariance between DM and time shifts:

1/f_ref**2 = sum weight 1/f**2 / sum weight

Here, the sum is over the frequency bands involved. In principle, the weight includes the spectral shape of the pulsar and gains, etc., which may make this perhaps not so easy to use in practice, though even with constant weight, it should be better than just taking the mean frequency.

Fold Should Take a Start Time Argument?

Would it be possible for the fold class to take a start time argument?

For instance, if my baseband stream has a start time at 12:00:12.5 , I might like my first time bin to start at 12:00:00 or 12:00:10. This would allow one to potentially align the time bins between different stations and datasets, which is important if you want to measure a time offset between two stations for instance.

Currently, I don't see any way to do this, as the Integrate Class forces to start time to be the start time of the baseband stream.

Different dispersion delay constant

It seems the dispersion delay constant used in this package is different from the one that has worked in the past for similar work.

In dm.py you define it as:
dispersion_delay_constant = u.s / 2.41e-4 * u.MHz**2 * u.cm**3 / u.pc

where 1/2.41e-4 is interpreted as the floating-point number: 4149.377593360996

whereas in my scripts, it has been set to:
4149. * u.s * u.MHz**2 * u.cm**3 / u.pc

which is definitely different. How would we go about figuring out which value is correct? Is there a way to confirm?

Integrating Multi-Channel Data with Corrupt Channel

I am currently having a small problem which I think Scintillometry can not handle at the moment.

I have baseband data recorded as vdif, with 8 separate channels. I would like to use the scintillometry to create dynamic spectra and folds. However, the last channel is corrupted and is missing almost all of its data. The first seven channels are completely fine however and I would like to use the fold class on just these.

I will roughly illustrate the procedure I followed.

First, I tried is using the GetItem class from scintillometry.shaping:
fh = vdif.open(fdir + fname, 'rs')
gih = GetItem(fh, slice(0, 6))

I then make an integration object like this:
integrator = integration.Fold(gih, n_phase=1024, phase=psr_polyco, step=10*u.s, average=True)

However, if I try something like
integrator.read(count=1)

I get an error message (attached) which says it cannot find all of the frames, likely because the eighth channel. I have also tried using different slices with no avail. My hope was that the GetItem class would allow me to now have to worry about the missing frames in the last channel, but it appears this is not the case.

Currently, I am planning on just separating the channels up into to separate files which may be a bit tedious. However, if there is an easier way to avoid this problem that would be amazing.

error.txt

Pyfftw classes use lots of memory

Inspired by @danasimard's comment that scintillometry uses more and more memory, leading one to need to restart python, a quick investigation:

from scintillometry.fourier import get_fft_maker
FFT = get_fft_maker()
fft = FFT(shape=(300000000,), dtype='f8')
# memory use goes up a lot.
fft = FFT(shape=(310000000,), dtype='f8')
# Now even more
del fft
# No change
import gc
gc.collect()
# small again

Apparently, they have internal arrays - data may well get copied into those, so worth checking out!

Also, that del fft doesn't do much while gc.collect() actually removes memory suggests some circular references inside.

Add sideband "0"

In PSRFITS SUBINT HDUs, the frequency is labeled in the center of the frequency band. We can shift it to the upper or lower. But should we add something like 0 which indicates it is the center of the band?

Unclear effect of keyword arguments in Polyco.phasepol() method

There's unexplained behavior for the Polyco.phasepol() method.

There is a keyword "convert" which can either be True or False. It is False by default, however the module docstring recommends:

For use with folding codes with times since some start time t0 in seconds:
>>> psr_polyco.phasepol(t0, 'fraction', t0=t0, time_unit=u.second, convert=True)

with no explanation for why the convert=True is necessary. Testing on my part shows that there is no measurable effect of this keyword as far as our use case is concerned. Why is the default value different than the implied "recommended" usage in the main docstring? Why is the keyword available at all if it doesn't affect the use case we are interested in?

Then, there are the arguments t0 and time_unit which are also undocumented. How t0 works and how to use it correctly would be useful, since the usage example simply sets it to the same value as the primary argument which isn't very instructive.

PINT and Polyco inconsistent in test

@luojing1211 - it turns out we were not actually testing the PINT phases at all - see #96, where I turned it on again. But the tests reveal that the polyco and PINT are actually inconsistent. My guess is that it is the polyco file that is the problem (my Predictor routine returns phases consistent with it), but I'm not sure. Could you have a look? Obviously, we should fix this...

See https://github.com/mhvk/scintillometry/blob/master/scintillometry/tests/test_phases.py#L84

Make new phase utilities useful for integration

Right now the phase utilities output 2-part phases, with an integer and a fractional part. But the phase callable used for integration is expected to produce a single value.

Possible solutions:

  1. Have an option to output just a single value
  2. Create a new Phase class similar to Time that holds the two values, and which can be subtracted, etc., giving a DeltaPhase (also two-part), but with the option to convert to a Quantity with units of cycle.

Allow writing and reading of intermediate pipeline steps

Would write as numpy handle, but needs to keep the handle attributes (start_time, etc.). Maybe something simpler than FITS? Can we use np.savez?? (save several arrays in a single file). Though want to use a memmap for the main data array.

Dedispersion is not correct at small DM values

Current Dedispersion/Dispersion class does not give the correct result when given a very small DM value ~0.01 if the samples_per_frame is small as well. The current workaround is to give a big enough smaples_per_frame value.

Flake 8 settings to .travis.yml

Should move flake8 items that should be ignored to .travis.yml, rather than have them in setup.cfg, since otherwise we don't get warned in emacs either.

Should underlying handles be closed?

Currently, if I close a stream, all underlying filehandles are closed as well. This may be a problem if the same data is being fed through multiple pipelines. One could either not close underlying streams, or have some kind of reference counter (perhaps just the python one) checked before closing oneself.

TimeShifter and TimeOffset tasks

Related but different in implementation:

  • Correct for clock offsets - this really is just a difference in start time. -> #104
  • Interpolate between samples -> #117
  • Shift & interpolate and apply corresponding phase shift -> #225

The most general case would correct for earth/pulsar motion during an observation (using a polyco or PINT), i.e., also stretch time and (Doppler) shift frequency.

Specific use cases:

  • Align two data streams to subsample precision for direct correlation. Implies offset of start_time + residual sample shift, plus taking into account associated phase changes.
  • Read out a giant pulse starting with its pulse peak, determined to subsample precision, falling on an integer sample. Easiest might seem to want to start on a specific time: implies new start_time + residual sample shift. Alternatively, call it TimeSeek(ih, time) which calculates the required sub-sample offset, initialites a TimeShift and then seeks to the requested time. This could also be a classmethod on TimeShift. EDIT: ended up just being easier as Resample

Give frequency via the function task

Do we encourage setting the frequency at the downstream? Currently, it does not.

from baseband import vdif
from scintillometry.functions import Square

data_dir = '/hdd/vidf_data/test_vidf/0001'
files = [os.path.join(data_dir, '0001{0:03d}.vdif'.format(n)) for n in range(50)]
ih = vdif.open(files, sample_rate = 400 * u.MHz / 1024)
sq = Square(ih)
sq .frequency = np.linspace(400, 800, 1024)

AttributeError Traceback (most recent call last)
in ()
1 sq = Square(ih)
----> 2 sq.frequency = np.linspace(400, 800, 1024)

AttributeError: can't set attribute

Using Scintillometry on numPy arrays

More of an inquiry then an issue:

If I have a waterfall of frequency and time as a numpy array and a polyco file, would it be possible to use Scintillometry to obtain a folded numpy array? Currently, the Fold Class takes only takes in baseband stream readers.

broadcast frequency

This is related to the PR #74.
PSRFITs data shape organized as (nsample, nbin, nchan, npol). When I input the frequency as 1-d array, it fails at the broadcast check. Since an (nchan, ) array can not broadcast to (nbin, nchan, npol). I need to reshape the frequency to (1, nchan, 1). Should I force frequency to (1, nchan, 1)?

Replace part of TaskBase.__init__ with a helper function

Currently, it tries to be very clever with input and output sample rates to determine frames_per_sample, _raw_frames_per_sample, etc. But perhaps too much magic, so replace with clear helper functions for specific purposes?

Folding very slow compared to Stack/Integrate

Mostly as a reminder. EDIT: Reason is likely the use of np.add.at. (is not the problem)

Edit: comparing speeds on Arecibo J1810 data with 4 channels sampled at 3.125 MHz:

ch = Channelize(dedispersed, 325)
square = Square(ch)
fold = Fold(square, nbin, polyco, step=1*u.s)
fold.read(1) # -> 45 s
stack = Stack(square, 16, polyco)
intg = Integrate(stack, 601)  # Also 1 s
intg.read(1)  # -> 7.5 s

For folding, about half the time (25s, with profiling on) is spent inside Predictor, calling it 9615 times, once for each spectrum from ch, while for stack+integrate, 1179 calls are made (2.8s) (with generally 16 entries per call). So, the main problem there is that in folding a call is made for each spectrum -> need larger samples_per_frame in ch.

Indeed, with that reduce the time to fold to 4.0 s (and integrate to 5.8 s).

So, this is mostly a documentation issue...

Error When Using Fold Class

Recently, I have attempted to use the new Scintillometry error, and it appears I have run into a bug. I have attached my files, but I will also describe the problem here:

Initially, the data is stored as baseband in a multi-threaded mark4 format. In total, each mark4 file has 8 threads with four frequency sub-bands and 2 polarizations. The four frequency bandwidths are [312.25 - 320.25, 320.25 - 328.25, 328.25 - 336.25, 336.26 - 344.25] MHz, and the polarizations are [R, L]. So for instance, in each mark4 file, the first thread is (freq1, R), the second thread is (freq1, L), the third thread is (freq2, R), and so forth.

Initially, I run my file fold.py. In this file, I use the scintillometry reshape class to reshape the streamer as follows:

rh = Reshape(fh, (4, 2))

Then, I call a Fold class that I defined in scintillometry_reduction (different than your scintillometry fold class) in the line

WF = sr.Fold(rh, dispersion_measure, frequency, sideband, polyco_file, polarization, fullpol, nthreads)

In this class, the error is thrown after I use the line which calls the Fold class in Scintillometry:

integrator = integration.Fold(power, n_phase=1024, phase=psr_polyco, step=self.interval, average=False)

where self.interval is 4 second (using astropy units) and is the sub integration time, psr_polyco is a polyco I generated with tempo2 and power is calculated from the above lines.

Essentially, what occurs is Fold calls the superclass init method for Integration which calls the super class init method for BaseTaskBase which calls the super class init method for Base which calls the method _check_frequency which tries (and fails) to broadcast the frequency axis to the same shape as the sample shape of the data. My suspicion is that somewhere in the Fold class, the shape is miscalculated. I have attached the full error message in error.txt. For some reason, it tries to broadcast the frequency array to the shape (1,4,4). However, I am not sure where this number is coming from, as the sample shape of the reader is (4,2) and the shape of both the sideband and frequency arrays is (1, 4, 2).

Thanks!

error.txt
scintillometry_reduction.txt
fold.txt

Allow Phase to input from and output to long strings

Can just be like DADA start times: split on '.' and read the two parts separately. Similarly, output could be constructed from the two parts.

To be decided: for creation from a string, have a class method, or adapt the initializer. For either case: do we allow arrays of string? For the initializer, how about giving both phase1 and phase2? Don't want to make the common case very slow to support this.

EDIT: the get-from-string would also be useful inside Predictor for the phase zero point.

Allow array of frequencies for phases.PintPhase

I have a waterfall from a PSRFITS file of shape (frequencies, phase) and wish to compute the new phases for each frequency from a .par file. Using an array of frequencies currently throws a ValueError.

ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()

Idea: generate fake pulsar from PSRFITS folded profile

In general, we'd like to be able to simulate fake pulsars, but there are lots of settings. One way to give all those is to pass on a PSRFITS folded profile, and use, e.g., the frequency and time settings stored in that file.

Move io.psrfits tests to their own directory

While it seems logical to have all tests of pulsar tasks in scintillometry.tasks, the io are really rather stand-alone and may as well be moved (after all, psrfits might make sense as part of baseband!).

Decide how to pass on "count" (or "weight"?)

We briefly had a BaseData subclass of ndarray with a count attribute (#44), but this was rather problematic, so for now we are using a structured dtype with 'data' and 'count' entries (#38). Should think if that is really what we want.

Revise source code tree

The current file structure, with folders for individual task, is overelaborate. It makes far more sense to have a structure like:

scintillometry/
	tasks/
		base.py
		channelize.py
		dedisperse.py <- dm.py contents go in here
		integrate.py
	fourier/
		base.py
		numpy.py
		pyfftw.py	

docs/
	tasks/
		index.rst <- covers all tasks
	fourier/
		index.rst

task.close() should have an option that does not close the underline task

The current .close() setup closes the underline task. Could we add one option that it does not delete the underline tasks? For instance, I have several DM corrections takes the same one Dedispersion class as the initial, If I close one correction task, it will delete the initial Dedispersion class. We could have a flag to avoid it. Maybe I can just delete the correction task. But I am not sure if this is the best way to do it.

Type Error when simulate data.

I am working on simulating some simple pulsar data, I got a TypeError when doing the following:

start_time = time.Time('2010-11-12T13:14:15')
sample_rate = 10. * u.kHz
shape = (5000, 2, 4)
nh = EmptyStreamGenerator(shape=shape,
                          start_time=start_time,
                          sample_rate=sample_rate,
                          samples_per_frame=200, dtype=int)

F0 = 1.0 / (10 * u.ms)
F1 = -1e-7 * (u.Hz/u.s)

def phase(dt, F0, F1):
    dt = dt.to(u.s)
    return F0 * dt + F1 * (dt)**2

def pulse_simulate(nh, data):
    t = nh.tell() / nh.sample_rate
    ph = phase(t, F0, F1)
    data = (10 if np.isclose(ph % 1, 0, atol=1e-4) else 0.125)
    return data

mt = Task(nh, pulse_simulate, samples_per_frame=2)

mt.seek(0)
mt.read(1)

The error message is:
TypeError Traceback (most recent call last)
in ()
----> 1 mt.read(1)

~/.local/lib/python3.6/site-packages/scintillometry-0.0.dev98-py3.6.egg/scintillometry/base.py in read(self, count, out)
253 self._frame_index = frame_index
254
--> 255 nsample = min(count, len(self._frame) - sample_offset)
256 data = self._frame[sample_offset:sample_offset + nsample]
257 # Copy to relevant part of output.

TypeError: object of type 'int' has no len()

Is there anything I am doing wrong?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.