melizalab / arfx Goto Github PK

0.0 3.0 1.0 420 KB

tools for using ARF files

License: GNU General Public License v2.0

Python 100.00%

arfx's Introduction

arfx

arfx is a family of commandline tools for copying sampled data in and out of ARF containers. ARF (https://github.com/melizalab/arf) is an open, portable file format for storing behavioral and neural data, based on HDF5.

installation

pip install arfx

or from source:

python setup.py install

usage

The general syntax is arfx operation [options] files. The syntax is similar to tar. Operations are as follows:

-A: copy data from one container to another
-c: create a new container
-r: append data to the container
-t: list contents of the container
-x: extract entries from the container
-d: delete entries from the container

Options specify the target ARF file, verbosity, automatic naming schemes, and any metadata to be stored in the entry.

-f FILE: use ARF file FILE
-v: verbose output
-n NAME: name entries sequentially, using NAME as the base
-a ANIMAL: specify the animal
-e EXPERIMENTER: specify the experimenter
-p PROTOCOL: specify the protocol
-s HZ: specify the sampling rate of the data, in Hz
-T DATATYPE: specify the type of data
-u: do not compress data in the arf file
-P: when deleting entries, do not repack

input files

arfx can read sampled data from pcm, wave, npy and mda files. Support for additional file formats can be added as plugins (see 4).

When adding data to an ARF container (-c and -r modes), the input files are specified on the command line, and added in the order given. By default, entries are given the same name as the input file, minus the extension; however, if the input file has more than one entry, they are given an additional numerical extension. To override this, the -n flag can be used to specify the base name; all entries are given sequential names based on this.

The -n, -a, -e, -p, -s, -T options are used to store information about the data being added to the file. The DATATYPE argument can be the numerical code or enumeration code (run arfx --help-datatypes for a list), and indicates the type of data in the entries. All of the entries created in a single run of arfx are given these values. The -u option tells arfx not to compress the data, which can speed up I/O operations slightly.

Currently only one sampled dataset per entry is supported. Clearly this does not encompass many use cases, but arfx is intended as a simple tool. More specialized import procedures can be easily written in Python using the arf library.

output files

The entries to be extracted (in -x mode) can be specified by name. If no names are specified, all the entries are extracted. All sampled datasets in each entry are extracted as separate channels, because they may have different sampling rates. Event datasets are not extracted.

By default the output files will be in wave format and will have names with the format entry_channel.wav. The -n argument can be used to customize the names and file format of the output files. The argument must be a template in the format defined by the python string module. Supported field names include entry, channel, and index, as well as the names of any HDF5 attributes stored on the entry or channel. The extension of the output template is used to determine the file format. Currently only wave is supported, but additional formats may be supplied as plugins (see 4).

The metadata options are ignored when extracting files; any metadata present in the ARF container that is also supported by the target container is copied.

other operations

As with tar, the -t operation will list the contents of the archive. Each entry/channel is listed on a separate line in path notation.

The -A flag is used to copy the contents of one ARF file to another. The entries are copied without modification from the source ARF file(s) to the target container.

The -d (delete) operation uses the same syntax as the extract operation, but instead of extracting the entries, they are deleted. Because of limitations in the underlying HDF5 library, this does not free up the space, so the file is repacked unless the -P option is set.

The -U (update) operation can be used to add or update attributes of entries, and to rename entries (if the -n flag is set).

The --write-attr operation can be used to store the contents of text files in top-level attributes. The attributes have the name user_<filename>. The --read-attr operation can be used to read out those attributes. This is useful when data collection programs generate log or settings files that you want to store in the ARF file.

other utilities

This package comes with a few additional scripts that do fairly specific operations.

arfx-split

This script is used to reorganize very large recordings, possibly contained in multiple files, into manageable chunks. Each new entry is given an updated timestamp and attributes from the source entries. Currently, no effort is made to splice data across entries or files. This may result in some short entries. Only sampled datasets are processed.

arfx-collect-sampled

This script is used to export data into a flat binary structure. It collects sampled data across channels and entries into a single 2-D array. The output can be stored in a multichannel wav file or in a raw binary dat format (N samples by M channels), which is used by a wide variety of spike-sorting tools.

extending arfx

Additional formats for reading and writing can be added using the Python setuptools plugin system. Plugins must be registered in the arfx.io entry point group, with a name corresponding to the extension of the file format handled by the plugin.

An arfx IO plugin is a class with the following required methods:

__init__(path, mode, **attributes): Opens the file at path. The mode argument specifies whether the file is opened for reading (r), writing (w), or appending (a). Must throw an IOError if the file does not exist or cannot be created, and a ValueError if the specified value for mode is not supported. The additional attributes arguments specify metadata to be stored in the file when created. arfx will pass all attributes of the channel and entry (e.g., channels, sampling_rate, units, and datatype) when opening a file for writing. This method may issue a ValueError if the caller fails to set a required attribute, or attempts to set an attribute inconsistent with the data format. Unsupported attributes should be ignored.

read(): Reads the contents of the opened file and returns the data in a format suitable for storage in an ARF file. Specifically, it must be an acceptable type for the arf.entry.add_data() method (see https://github.com/melizalab/arf for documentation).

write(data): Writes data to the file. Must issue an IOError if the file is opened in the wrong mode, and TypeError if the data format is not correct for the file format.

timestamp: A readable property giving the time point of the data. The value may be a scalar indicating the number of seconds since the epoch, or a two-element sequence giving the number of seconds and microseconds since the epoch. If this property is writable it will be set by arfx when writing data.

sampling_rate: A property indicating the sampling rate of the data in the file (or current entry), in units of Hz.

The class may also define the following methods and properties. If any property is not defined, it is assumed to have the default value defined below.

nentries: A readable property indicating the number of entries in the file. Default value is 1.

entry: A readable and writable integer-valued property corresponding to the index of the currently active entry in the file. Active means that the read() and write() methods will affect only that entry. Default is 0, and arfx will not attempt to change the property if nentries is 1.

version information

arfx uses semantic versioning and is synchronized with the major/minor version numbers of the arf package specification.

arfx's People

Contributors

Watchers

Forkers

johndpope

arfx's Issues

module: downsampler

Need a module to downsample data in order to efficiently process EEG data.

modify toolchain dsl to support more sophisticated filters

Currently you can only apply filters in series (i.e., the chunk has to match all the predicates), but it would be nice to specify more sophisticated filters. It may not be worth the effort, though, and almost certainly will break backwards compatibility without a lot of bending over backwards...

arfxplog: transition argument parsing code to argparse

remove arfx-oephys dependency on timestamps.npy?

It looks like kilosort or phy is deleting this file, which causes the arfx-oephys script to throw an error. This only comes up if the person doing the analysis tries to sort spikes before doing the arf conversion, but that's impossible to enforce, and it's often desirable to do the spike sorting first to make sure that there are decent units in the recording.

The script only uses the first sample in timestamps.npy, so if it's possible to convert the data without needing the file, that would eliminate this issue entirely.

Export all sampled datasets to a single multichannel raw file

This kind of output format is needed for spike sorting with spyking circus, klusta, etc.

Doesn't quite fit into the standard extract operation mode, because the data will be collated across entries. It's more similar to the copy operation, except that the target file is not an arf file.

arfx.repack_file overwrites existing files on errors

deprecate arfxplog

I don't think anyone collects data using saber; stop supporting after 2.5

patch to run on python 3

modify spike_extract to collect spikes on multiple channels

This is needed for experimental configurations where spikes can appear on multiple channels.

arfx: show properties of datasets in list mode

arfx-collect-sampled needs to skip top-level datasets

module: data viewer

It would be useful to have a consumer module for paging through data. However, this may not really fit into the pipeline model all that well.

needed: CLI tool to extract specific intervals into new arf file

This would be very useful for selecting out songs from long recording sessions based on manual or automatic song detection. Control of which segments are extracted would be read from stdin in a simple json-based structure.

better documentation of toolchain dsl

The domain-specific language for defining toolchains/pipleiness needs to be documented somewhere that the user can see it.

Cannot convert to arf file without timestamps.npy file

The folder is called C194_2023-10-16_16-30-54_chorus on grafisia

arfx-oephys -T EXTRAC_HP -k experimenter=smm3rc -k bird=C194 -k pen=1 -k site=1 -k protocol=chorus -f C194_1_1.arf C194_2023-10-16_16-30-54_chorus/

[arfx-oephys] Opened 'C194_1_1.arf' for writing
[arfx-oephys] Reading from 'C194_2023-10-16_16-30-54_chorus/Record Node 104/experiment1/recording1':
[arfx-oephys] - warning: timestamps.npy file is missing for Rhythm_FPGA-100.0; falling back on sync_messages.txt
Traceback (most recent call last):
File "/home/melizalab/.local/pipx/venvs/arfx/lib/python3.8/site-packages/arfx/oephys.py", line 48, in init
timestamps = np.load(
File "/home/melizalab/.local/pipx/venvs/arfx/lib/python3.8/site-packages/numpy/lib/npyio.py", line 405, in load
fid = stack.enter_context(open(os_fspath(file), "rb"))
FileNotFoundError: [Errno 2] No such file or directory: 'C194_2023-10-16_16-30-54_chorus/Record Node 104/experiment1/recording1/continuous/Rhythm_FPGA-100.0/timestamps.npy'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/home/melizalab/.local/bin/arfx-oephys", line 8, in
sys.exit(script())
File "/home/melizalab/.local/pipx/venvs/arfx/lib/python3.8/site-packages/arfx/oephys.py", line 323, in script
rec = recording(dir)
File "/home/melizalab/.local/pipx/venvs/arfx/lib/python3.8/site-packages/arfx/oephys.py", line 171, in init
dset = continuous_dset(path, processor)
File "/home/melizalab/.local/pipx/venvs/arfx/lib/python3.8/site-packages/arfx/oephys.py", line 64, in init
raise RuntimeError(
RuntimeError: unable to determine sync time for Rhythm_FPGA-100.0 dataset

arfxplog: use logging module rather than print commands

arfx raises errors trying to save label data

Arfx attempts to save label data in wave files, generating an error. Should gracefully skip these entries.

arfx: use logging module rather than print commands

module: median rescaler

Rescaling data by the median and median absolute deviation is less sensitive to outliers and transients.

Add mode for setting top-level attributes

It can be useful to store data from external text files in the HDF5 file (for example, the settings.xml file generated by openephys)

arfxplog should assign uuids to each channel in each created arf file

Allows unambiguous identification of source data

TypeError when extracting all entries

I get this error when trying to extract all entries from an arf file:

Traceback (most recent call last):
  File "//anaconda/bin/arfx", line 11, in <module>
    sys.exit(arfx())
  File "//anaconda/lib/python3.4/site-packages/arfx/arfx.py", line 578, in arfx
    args.op(args.arffile, entries, **opts)
  File "//anaconda/lib/python3.4/site-packages/arfx/arfx.py", line 287, in extract_entries
    dset = entry[channel]
  File "h5py/_objects.pyx", line 54, in h5py._objects.with_phil.wrapper (-------src-dir--------/h5py/_objects.c:2582)
  File "h5py/_objects.pyx", line 55, in h5py._objects.with_phil.wrapper (-------src-dir--------/h5py/_objects.c:2541)
  File "//anaconda/lib/python3.4/site-packages/h5py/_hl/dataset.py", line 462, in __getitem__
    selection = sel.select(self.shape, args, dsid=self.id)
  File "//anaconda/lib/python3.4/site-packages/h5py/_hl/selections.py", line 88, in select
    sel[args]
  File "//anaconda/lib/python3.4/site-packages/h5py/_hl/selections.py", line 356, in __getitem__
    if sorted(arg) != list(arg):
TypeError: unorderable types: numpy.ndarray() > str()

Running Python 3.4, arfx 2.2.3, and h5py 2.6.0

arfx upgrade fails on datasets with ndims > 2

This problem affects files that have been upgraded to the 1.1 spec and then subsequently to the 2.0 spec. The upgrade may create datasets with rank 2, but with the second dimension size 1. For example:

d.shape = (60000,1)
d.maxshape = (60000, None)

These datasets aren't chunked properly and can't be repacked by h5repack.

The solution is to fix the 1.1 to 2.0 upgrade function to check for this condition and copy the data into a new dataset with the correct layout. The <1.1 function should also probably be fixed to avoid creating this problem.

module: highpass/lowpass filter

This is needed to split broadband data into EEG and spike data. Should use Peter Malonis's filter code from JILL to ensure we're using the same algorithms

split jobs across processors

The pipeline-based model is highly amenable to parallelization. Instead of directly passing chunks between processing modules, they're placed into queues and then assigned to workers. Simple in principle but potentially a lot of work.

Option to restrict channels in arfx-oephys

Add a flag to take in a file with a list of channels to include, excluding all others. Useful when (for example) recording a 64-channel electrode on a 128-channel headstage, as there does not appear to be an easy way to leave the channels out of the recording.

arfx does not set timestamp on extracted wave files

The creation and modification time of wave files created from extracted PCM data should be the same as the timestamp in the arf file, if it exists.

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.