mhvk / baseband-tasks Goto Github PK
View Code? Open in Web Editor NEWA package for radio baseband data reduction and analysis.
Home Page: https://baseband.readthedocs.io/projects/baseband-tasks
License: GNU General Public License v3.0
A package for radio baseband data reduction and analysis.
Home Page: https://baseband.readthedocs.io/projects/baseband-tasks
License: GNU General Public License v3.0
Apparently, something like
pip3 install git+https://github.com/matplotlib/matplotlib.git
is possible (see https://github.com/astropy/astropy/pull/9726/files#diff-1d37e48f9ceff6d8030570cd36286a61R99) - could use this to simplify our .travis.yml
setup.
I am having trouble reading data while using scintillometry
and baseband
. I am not sure whether the error is in one or the other, but I have my suspicions that it is the former.
Note that for previous versions of both softwares the package ran fine with the same dataset.
My software basically wraps the functions for reading and changing shape (ChangeSampleShape
), these are:
def telescope_shape(fh, telescope):
"""
Reshape data for ARO or CHIME with a file handler. This is the standard way
to transform data to V[time, frequency, polarization]. ``A = CHIME`` and
``B = ARO``.
Parameters
----------
fh : `baseband.vdif.base.VDIFStreamReader`
File-handler from `baseband`.
telescope : `str`:
Telescope name: ``'CHIME'`` or ``'ARO'``.
Returns
-------
fh_reshaped : `baseband.vdif.base.VDIFStreamReader`
File-handler with the new shape.
"""
def reshape_chime(data):
return data.reshape(-1, 256, 4, 2).transpose(
0, 2, 1, 3).reshape(-1, 1024, 2)
def reshape_aro(data):
return data.swapaxes(1, 2)
if telescope == 'CHIME' or telescope == 'A':
fh_reshaped = ChangeSampleShape(fh, reshape_chime)
elif telescope == 'ARO' or telescope == 'B':
fh_reshaped = ChangeSampleShape(fh, reshape_aro)
return fh_reshaped
and
def read_ntime(
data_sorted, telescope, timestamp_i, ntime, freq_resolution=390625 * u.Hz,
verify=True
):
"""
Given an initial timestamp, ``timestamp_i``, read the following ``ntime``
number of samples that follow.
Parameters
----------
data_sorted : `list`
List of strings with the corresponding paths of files to be opened.
Files must be sorted chronologically.
telescope : `str`:
Telescope name: ``'CHIME'`` or ``'ARO'``.
timestamp_i : `astropy.time.core.Time`
Initial timestamp from where to start reading.
ntime : `int`
Number of time samples of `2.56e-6` s to be opened.
freq_resolution : `int`
Frequency resolution, sample rate or frequency sampling in
math:`\\mathrm{Hz}`.
verify : `bool`
Whether to do consistency checks on frames being read. See `baseband`
reference. It is set default `True`.
Returns
-------
V : `np.ndarray`
Three-dimension array, shaped as ``V[ntime, frequency, polarization]``.
limits: `list`
List containing the `astropy.time.core.Time` of start and stop of
`data_sorted`.
"""
fraw = helpers.sequentialfile.open(data_sorted, 'rb')
fh = vdif.open(fraw, mode='rs', sample_rate=freq_resolution, verify=verify)
fh = telescope_shape(fh, telescope=telescope)
fh.seek(timestamp_i)
# updating timestamp with baseband
timestamp_i_corrected = fh.tell(unit='time')
V = fh.read(ntime)
# checking whether the timestamp is in the sequence
timestamp_f = fh.tell(unit='time')
if fh.stop_time < timestamp_f:
pass
else:
timestamp_f = fh.stop_time
fh.close()
limits = [timestamp_i_corrected, timestamp_f]
return V, limits
While running my GP-search I encounter the same error over and over, until it ran through all the data set.
Traceback (most recent call last):
File "/home/v/vanderli/cassane/babendil-frb-vlbi/scripts/gp_run_search.py", line 184, in <module>
gp_thres=gp_thres
File "/home/v/vanderli/cassane/anaconda3/lib/python3.7/site-packages/babendil-0.0.1-py3.7.egg/babendil/sigprocess/gp_search.py", line 298, in run
V_cdd, limits = self.process_file(timestamp, ntime)
File "/home/v/vanderli/cassane/anaconda3/lib/python3.7/site-packages/babendil-0.0.1-py3.7.egg/babendil/sigprocess/gp_search.py", line 207, in process_file
verify=False
File "/home/v/vanderli/cassane/anaconda3/lib/python3.7/site-packages/babendil-0.0.1-py3.7.egg/babendil/io/reader.py", line 172, in read_ntime
V = fh.read(ntime)
File "/home/v/vanderli/cassane/anaconda3/lib/python3.7/site-packages/scintillometry-0.0.dev394-py3.7.egg/scintillometry/base.py", line 298, in read
raise EOFError("cannot read from beyond end of input.")
EOFError: cannot read from beyond end of input.
Located here:
Some other comments:
verify=False
It seems to make little sense to have to import the tasks from a given module, except perhaps the helpers and io
. Just import them all at the top level?
Would be useful both to produce input for, say, correlation, and to combine, e.g., multiple channels from a single source split over multiple files.
Would be nice to be able to apply/correct rotation measure as one does for dispersion. From an e-mail by Dongzi (recipe slightly edited):
Here is the algorithm to correct the cable delay and RM for linear voltage data. It is straight-forward to test, just form the stokes with the corrected X and Y, and see if all the effects are removed. With the same parameter, if you are able to calibrate the stokes, you are able to calibrate the X and Y. If you want to do it more rigidly, there should be algorithm to calculate the Jones matrix given two calibration sources, it would also be straight-forward to apply (probably ask Luke for this).
X=X*exp(-i tau f)
, where tau
is the cable delay and f
is the frequencyL,R=0.5*(X±iY)
(definition of L and R might be opposite);L=L*exp(- i RM lambda^2)
, R=R *exp(i RM lambda^2)
X=L+R, Y=(L-R)i
so the combined Jones matrix for the above steps will be
J=[[exp(i tau f) cos (RM lambda^2), -sin (RM lambda^2)],
[exp(-i tau f) sin(RM lambda^2), -cos(RM lambda^2)]]
and J dot [X,Y]
would give you the corrected X
and Y
.
But I highly recommend to break the steps and form stokes parameters for each steps and see the effect for the first burst. I might be missing some constant phase, or the convention for RM/cable delay may be off by a sign, etc. But I think you are very smart, so this will be easy for you to find out.
cc @theXYZT
As raised by @luojing1211 per e-mail: Once one has set up a pipeline, how to actually get it to write out its results? And perhaps also write intermediate steps?
One possibility, at least for the final step, is to create a memory-mapped output and then use that for the out
argument in read. E.g., currently, the following works:
from numpy.lib.format import open_memmap
...
stack = Stack(square, nbin, phase)
mm = open_memmap('stack.npy', 'w+', dtype='f4', shape=stack.shape)
stack.read(out=mm)
FITS files similarly can be opened with memmap=True
, so this should work reasonably transparently. All we would need is then to have some functions that create the right output files, filling the headers.
See also a complete pipeline at https://gist.github.com/mhvk/65944fcc8477c7a40ef10a896bb90a56
In PR #119 I added the data writing functionality. The file handler shape is (nsample, nbin, npol, nchan), but the PSRFITS shape is (nsample, nbin, nchan, npol). So the
read(nsample, out = PSRFITS)
will report the bug of not the same shape. Where should the reshape happen?
For folding, we need to return not just the folded data, but also how many samples go in. For this purpose, we perhaps should return a subclass of ndarray
which holds a count
attribute. This could be extended to all tasks and even to baseband (mhvk/baseband#286) to take into account missing data or data removed by, e.g., RFI clipping.
It seems reasonable for this attribute not to propagate much, if at all, i.e., most operations would delete it.
The real default should probably be: "fftw if available else numpy"
Obviously, need some. Worth investigating different types. E.g., quite a bit of success by looking for non-Gaussian signals - see https://arxiv.org/abs/1903.00588
Also see https://www.aanda.org/articles/aa/full_html/2021/06/aa40164-20/aa40164-20.html and https://arxiv.org/abs/2111.14225v1
@quantumfx pointed to a paper by Pennucci (https://iopscience.iop.org/article/10.1088/0004-637X/790/2/93), which includes a discussion of the optimal reference frequency for dedispersion, which minimizes covariance between DM and time shifts:
1/f_ref**2 = sum weight 1/f**2 / sum weight
Here, the sum is over the frequency bands involved. In principle, the weight includes the spectral shape of the pulsar and gains, etc., which may make this perhaps not so easy to use in practice, though even with constant weight, it should be better than just taking the mean frequency.
Would it be possible for the fold class to take a start time argument?
For instance, if my baseband stream has a start time at 12:00:12.5 , I might like my first time bin to start at 12:00:00 or 12:00:10. This would allow one to potentially align the time bins between different stations and datasets, which is important if you want to measure a time offset between two stations for instance.
Currently, I don't see any way to do this, as the Integrate Class forces to start time to be the start time of the baseband stream.
It seems the dispersion delay constant used in this package is different from the one that has worked in the past for similar work.
In dm.py
you define it as:
dispersion_delay_constant = u.s / 2.41e-4 * u.MHz**2 * u.cm**3 / u.pc
where 1/2.41e-4 is interpreted as the floating-point number: 4149.377593360996
whereas in my scripts, it has been set to:
4149. * u.s * u.MHz**2 * u.cm**3 / u.pc
which is definitely different. How would we go about figuring out which value is correct? Is there a way to confirm?
Right now, this gives errors because the sample_rate
unit is not equivalent to Hz
.
I am currently having a small problem which I think Scintillometry can not handle at the moment.
I have baseband data recorded as vdif, with 8 separate channels. I would like to use the scintillometry to create dynamic spectra and folds. However, the last channel is corrupted and is missing almost all of its data. The first seven channels are completely fine however and I would like to use the fold class on just these.
I will roughly illustrate the procedure I followed.
First, I tried is using the GetItem class from scintillometry.shaping:
fh = vdif.open(fdir + fname, 'rs')
gih = GetItem(fh, slice(0, 6))
I then make an integration object like this:
integrator = integration.Fold(gih, n_phase=1024, phase=psr_polyco, step=10*u.s, average=True)
However, if I try something like
integrator.read(count=1)
I get an error message (attached) which says it cannot find all of the frames, likely because the eighth channel. I have also tried using different slices with no avail. My hope was that the GetItem class would allow me to now have to worry about the missing frames in the last channel, but it appears this is not the case.
Currently, I am planning on just separating the channels up into to separate files which may be a bit tedious. However, if there is an easier way to avoid this problem that would be amazing.
Inspired by @danasimard's comment that scintillometry
uses more and more memory, leading one to need to restart python, a quick investigation:
from scintillometry.fourier import get_fft_maker
FFT = get_fft_maker()
fft = FFT(shape=(300000000,), dtype='f8')
# memory use goes up a lot.
fft = FFT(shape=(310000000,), dtype='f8')
# Now even more
del fft
# No change
import gc
gc.collect()
# small again
Apparently, they have internal arrays - data may well get copied into those, so worth checking out!
Also, that del fft
doesn't do much while gc.collect()
actually removes memory suggests some circular references inside.
In PSRFITS SUBINT HDUs, the frequency is labeled in the center of the frequency band. We can shift it to the upper or lower. But should we add something like 0 which indicates it is the center of the band?
There's unexplained behavior for the Polyco.phasepol() method.
There is a keyword "convert" which can either be True or False. It is False by default, however the module docstring recommends:
For use with folding codes with times since some start time t0 in seconds:
>>> psr_polyco.phasepol(t0, 'fraction', t0=t0, time_unit=u.second, convert=True)
with no explanation for why the convert=True
is necessary. Testing on my part shows that there is no measurable effect of this keyword as far as our use case is concerned. Why is the default value different than the implied "recommended" usage in the main docstring? Why is the keyword available at all if it doesn't affect the use case we are interested in?
Then, there are the arguments t0
and time_unit
which are also undocumented. How t0
works and how to use it correctly would be useful, since the usage example simply sets it to the same value as the primary argument which isn't very instructive.
@luojing1211 - it turns out we were not actually testing the PINT phases at all - see #96, where I turned it on again. But the tests reveal that the polyco and PINT are actually inconsistent. My guess is that it is the polyco file that is the problem (my Predictor
routine returns phases consistent with it), but I'm not sure. Could you have a look? Obviously, we should fix this...
See https://github.com/mhvk/scintillometry/blob/master/scintillometry/tests/test_phases.py#L84
Currently, the docs are failing to build because of a pyfftw/scipy issue. This also means one cannot use FFTW if scipy 1.4 is present. It seems this is about to be fixed but if not, we can temporarily fix scipy to 1.3.
I encountered this error when using the dedispersion module, but I have never seen it before. @mhvk have you seen this error before? Do I have to update my FFTW?
Right now the phase utilities output 2-part phases, with an integer and a fractional part. But the phase
callable used for integration is expected to produce a single value.
Possible solutions:
Phase
class similar to Time
that holds the two values, and which can be subtracted, etc., giving a DeltaPhase
(also two-part), but with the option to convert to a Quantity
with units of cycle.Would write as numpy handle, but needs to keep the handle attributes (start_time
, etc.). Maybe something simpler than FITS? Can we use np.savez
?? (save several arrays in a single file). Though want to use a memmap for the main data array.
Somehow, the version
file seems to have gone missing.
Current Dedispersion/Dispersion class does not give the correct result when given a very small DM value ~0.01 if the samples_per_frame is small as well. The current workaround is to give a big enough smaples_per_frame value.
Should move flake8 items that should be ignored to .travis.yml
, rather than have them in setup.cfg
, since otherwise we don't get warned in emacs either.
Currently, if I close a stream, all underlying filehandles are closed as well. This may be a problem if the same data is being fed through multiple pipelines. One could either not close underlying streams, or have some kind of reference counter (perhaps just the python one) checked before closing oneself.
Partially seems to result from how table is used in Predictor
, but some of it may be PINT-related.
Right now, if dedispersing to a reference frequency outside of the band, the code just adds a time offset. It would make sense to ensure that that offset corresponds to an integer number of samples.
Related but different in implementation:
The most general case would correct for earth/pulsar motion during an observation (using a polyco or PINT), i.e., also stretch time and (Doppler) shift frequency.
Specific use cases:
TimeSeek(ih, time)
which calculates the required sub-sample offset, initialites a TimeShift
and then seeks to the requested time
. This could also be a classmethod on TimeShift
. EDIT: ended up just being easier as Resample
Do we encourage setting the frequency at the downstream? Currently, it does not.
from baseband import vdif
from scintillometry.functions import Square
data_dir = '/hdd/vidf_data/test_vidf/0001'
files = [os.path.join(data_dir, '0001{0:03d}.vdif'.format(n)) for n in range(50)]
ih = vdif.open(files, sample_rate = 400 * u.MHz / 1024)
sq = Square(ih)
sq .frequency = np.linspace(400, 800, 1024)
AttributeError Traceback (most recent call last)
in ()
1 sq = Square(ih)
----> 2 sq.frequency = np.linspace(400, 800, 1024)
AttributeError: can't set attribute
More of an inquiry then an issue:
If I have a waterfall of frequency and time as a numpy array and a polyco file, would it be possible to use Scintillometry to obtain a folded numpy array? Currently, the Fold Class takes only takes in baseband stream readers.
Or at least that there is no frequency
without sideband
as that would break calculations.
This is related to the PR #74.
PSRFITs data shape organized as (nsample, nbin, nchan, npol). When I input the frequency as 1-d array, it fails at the broadcast check. Since an (nchan, ) array can not broadcast to (nbin, nchan, npol). I need to reshape the frequency to (1, nchan, 1). Should I force frequency to (1, nchan, 1)?
Currently, it tries to be very clever with input and output sample rates to determine frames_per_sample
, _raw_frames_per_sample
, etc. But perhaps too much magic, so replace with clear helper functions for specific purposes?
Mostly as a reminder. EDIT: Reason is likely the use of (is not the problem)np.add.at
.
Edit: comparing speeds on Arecibo J1810 data with 4 channels sampled at 3.125 MHz:
ch = Channelize(dedispersed, 325)
square = Square(ch)
fold = Fold(square, nbin, polyco, step=1*u.s)
fold.read(1) # -> 45 s
stack = Stack(square, 16, polyco)
intg = Integrate(stack, 601) # Also 1 s
intg.read(1) # -> 7.5 s
For folding, about half the time (25s, with profiling on) is spent inside Predictor
, calling it 9615 times, once for each spectrum from ch
, while for stack+integrate, 1179 calls are made (2.8s) (with generally 16 entries per call). So, the main problem there is that in folding a call is made for each spectrum -> need larger samples_per_frame
in ch
.
Indeed, with that reduce the time to fold to 4.0 s (and integrate to 5.8 s).
So, this is mostly a documentation issue...
The task base perhaps should have __enter__
and __exit__
methods, so they look even more like file handles.
In the DedispersionMeasure class. It would be nice to have the functionality to estimate the dm when given time delay and its frequencies.
Recently, I have attempted to use the new Scintillometry error, and it appears I have run into a bug. I have attached my files, but I will also describe the problem here:
Initially, the data is stored as baseband in a multi-threaded mark4 format. In total, each mark4 file has 8 threads with four frequency sub-bands and 2 polarizations. The four frequency bandwidths are [312.25 - 320.25, 320.25 - 328.25, 328.25 - 336.25, 336.26 - 344.25] MHz, and the polarizations are [R, L]. So for instance, in each mark4 file, the first thread is (freq1, R), the second thread is (freq1, L), the third thread is (freq2, R), and so forth.
Initially, I run my file fold.py. In this file, I use the scintillometry reshape class to reshape the streamer as follows:
rh = Reshape(fh, (4, 2))
Then, I call a Fold class that I defined in scintillometry_reduction (different than your scintillometry fold class) in the line
WF = sr.Fold(rh, dispersion_measure, frequency, sideband, polyco_file, polarization, fullpol, nthreads)
In this class, the error is thrown after I use the line which calls the Fold class in Scintillometry:
integrator = integration.Fold(power, n_phase=1024, phase=psr_polyco, step=self.interval, average=False)
where self.interval is 4 second (using astropy units) and is the sub integration time, psr_polyco is a polyco I generated with tempo2 and power is calculated from the above lines.
Essentially, what occurs is Fold calls the superclass init method for Integration which calls the super class init method for BaseTaskBase which calls the super class init method for Base which calls the method _check_frequency which tries (and fails) to broadcast the frequency axis to the same shape as the sample shape of the data. My suspicion is that somewhere in the Fold class, the shape is miscalculated. I have attached the full error message in error.txt. For some reason, it tries to broadcast the frequency array to the shape (1,4,4). However, I am not sure where this number is coming from, as the sample shape of the reader is (4,2) and the shape of both the sideband and frequency arrays is (1, 4, 2).
Thanks!
linear -> circular; see https://github.com/theXYZT/pulsarbat/blob/master/pulsarbat/transforms.py#L135
Needs big warning that gains have to be solved for first.
Can just be like DADA
start times: split on '.' and read the two parts separately. Similarly, output could be constructed from the two parts.
To be decided: for creation from a string, have a class method, or adapt the initializer. For either case: do we allow arrays of string? For the initializer, how about giving both phase1 and phase2? Don't want to make the common case very slow to support this.
EDIT: the get-from-string would also be useful inside Predictor
for the phase zero point.
I have a waterfall from a PSRFITS file of shape (frequencies, phase) and wish to compute the new phases for each frequency from a .par file. Using an array of frequencies currently throws a ValueError
.
ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()
In general, we'd like to be able to simulate fake pulsars, but there are lots of settings. One way to give all those is to pass on a PSRFITS folded profile, and use, e.g., the frequency and time settings stored in that file.
While it seems logical to have all tests of pulsar tasks in scintillometry.tasks
, the io
are really rather stand-alone and may as well be moved (after all, psrfits
might make sense as part of baseband!).
The current file structure, with folders for individual task, is overelaborate. It makes far more sense to have a structure like:
scintillometry/
tasks/
base.py
channelize.py
dedisperse.py <- dm.py contents go in here
integrate.py
fourier/
base.py
numpy.py
pyfftw.py
docs/
tasks/
index.rst <- covers all tasks
fourier/
index.rst
The current .close() setup closes the underline task. Could we add one option that it does not delete the underline tasks? For instance, I have several DM corrections takes the same one Dedispersion class as the initial, If I close one correction task, it will delete the initial Dedispersion class. We could have a flag to avoid it. Maybe I can just delete the correction task. But I am not sure if this is the best way to do it.
I am working on simulating some simple pulsar data, I got a TypeError when doing the following:
start_time = time.Time('2010-11-12T13:14:15')
sample_rate = 10. * u.kHz
shape = (5000, 2, 4)
nh = EmptyStreamGenerator(shape=shape,
start_time=start_time,
sample_rate=sample_rate,
samples_per_frame=200, dtype=int)
F0 = 1.0 / (10 * u.ms)
F1 = -1e-7 * (u.Hz/u.s)
def phase(dt, F0, F1):
dt = dt.to(u.s)
return F0 * dt + F1 * (dt)**2
def pulse_simulate(nh, data):
t = nh.tell() / nh.sample_rate
ph = phase(t, F0, F1)
data = (10 if np.isclose(ph % 1, 0, atol=1e-4) else 0.125)
return data
mt = Task(nh, pulse_simulate, samples_per_frame=2)
mt.seek(0)
mt.read(1)
The error message is:
TypeError Traceback (most recent call last)
in ()
----> 1 mt.read(1)
~/.local/lib/python3.6/site-packages/scintillometry-0.0.dev98-py3.6.egg/scintillometry/base.py in read(self, count, out)
253 self._frame_index = frame_index
254
--> 255 nsample = min(count, len(self._frame) - sample_offset)
256 data = self._frame[sample_offset:sample_offset + nsample]
257 # Copy to relevant part of output.
TypeError: object of type 'int' has no len()
Is there anything I am doing wrong?
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.