neuroinformatics-unit / spikewrap Goto Github PK

A tool to streamline extracellular electrophysiology analysis using SpikeInterface

License: BSD 3-Clause "New" or "Revised" License

Python 100.00%

electrophysiology extracellular-electrophysiology neuropixels

spikewrap's Introduction

Warning Spikewrap is not sufficiently tested to be used in analysis. This release is only for testing. Do not use for your final analyses.

Warning Limitations

works only on SpikeGLX recordings with 1 gate, 1 trigger, 1 probe (per run, e.g. g0, t0, imec0)

requires standard input folder format

only run one subject / run at a time

has limited preprocessing options (tshift, bandpass_filter, common median reference)

no options to remove potentially large intermediate files

installation / running on HPC is a bit clunky. In future this can be simplified with SLURM jobs organised under the hood and setting up a HPC module.

untested!

The documentation is currently outdated.

Features

preprocess SpikeGLX data (tshift, bandpass_filter, common median reference)
spike sorting (kilosort2, kilosort2_5, kilosort3)
quality check measures on the sorting results

Local Installation

Sorting requires a NVIDIA GPU and so is currently only available using the SWC's High-Performance Computer (HPC). However, local installation is useful for visualising the preprocessing steps prior to running the full pipeline (see 'Visualisation' below).

To install locally, clone the repository to your local machine using git.

git clone [email protected]:neuroinformatics-unit/spikewrap.git

Change directory to the repo and install using

pip install -e .

or, to also install developer dependencies

pip install -e .[dev]

or if using the zsh shell

pip install -e ".[dev]"

After installation, the module can be imported with import spikewrap.

Running on the HPC

Currently, sorting is required to run on the SWC HPC with access to /ceph/neuroinformatics.

To connect and run on the HPC (e.g. from Windows, macOS or Linux terminal):

ssh [email protected]

ssh hpc-gw1

The first time using, it is necessary to steup and install spikewrap. It is strongly recommended to make a new conda environment on the HPC, before installing spikewrap.

module load miniconda

conda create --name spikewrap python=3.10

conda activate spikewrap

and install spikewrap and it's dependencies:

mkdir ~/git-repos

cd ~/git-repos

git clone https://github.com/JoeZiminski/spikewrap.git

cd spikewrap

pip install -e .

Before running, it is necessary to request use of a GPU node on the HPC to run spike sorting with KiloSort. To run preprocessing and spike sorting, create a script using the API or call from the command line interface (instructions below).

srun -p gpu --gres=gpu:1 -n 8 --mem=40GB --pty bash -i

module load cuda

module load miniconda

conda activate spikewrap

python my_pipeline_script.py

Quick Start Guide

Spikewrap (currently) expects input data to be stored in a rawdata folder. A subject (e.g. mouse) data should be stored in the rawdata folder and contain SpikeGLX output format (example below). Currently, only recordings with 1 gate, 1 trigger and 1 probe are supported (i.e. index 0 for all gate, trigger probe, g0, t0 and imec0).

└── rawdata/
    └── 1110925/
        └── 1110925_test_shank1_g0/
            └── 1110925_test_shank1_g0_imec0/
                ├── 1110925_test_shank1_g0_t0.imec0.ap.bin
                └── 1110925_test_shank1_g0_t0.imec0.ap.meta

API (script)

Example code to analyse this data in this format is below:

from spikewrap.pipeline.full_pipeline import run_full_pipeline

base_path = "/ceph/neuroinformatics/neuroinformatics/scratch/ece_ephys_learning"

if __name__ == "__main__":

    run_full_pipeline(
        base_path=base_path,
        sub_name="1110925",
        run_name="1110925_test_shank1",
        config_name="test",
        sorter="kilosort2_5",
    )

base_path is the path containing the required rawdata folder.

sub_name is the subject to run, and run_name is the SpikeGLX run name to run.

configs_name contains the name of the preprocessing / sorting settings to use (see below)

sorter is the name of the sorter to use (currently supported is kilosort2, kilosort2_5 and kilosort3)

Note run_full_pipline must be run in the if __name__ == "__main__" block as it uses the multiprocessing module.

Output

Output of spike sorting will be in a derivatives folder at the same level as the rawdata. The subfolder organisation of derivatives will match rawdata.

Output are the saved preprocessed data, spike sorting results as well as a list of quality check measures. For example, the full output of a sorting run with the input data as above is:

├── rawdata/
│   └── ...
└── derivatives/
    └── 1110925/
        └── 1110925_test_shank1_g0  /
            └── 1110925_test_shank1_g0_imec0/
                ├── preprocessed/
                │   ├── data_class.pkl
                │   └── si_recording
                ├── kilosort2_5-sorting/
                    ├── in_container_sorting/
                    ├── sorter_output/
                    ├── waveforms/
                    │   └── <spikeinterface waveforms output>
                    ├── quality_metrics.csv
                    ├── spikeinterface_log.json
                    ├── spikeinterface_params.json
                    └── spikeinterface_recording.json

preprocessed:

Binary-format spikeinterface recording from the final preprocessing step (si_recording) 2) data_class.pkl spikewrap internal use.

-sorting output (e.g. kilosort2_5-sorting, multiple sorters can be run):

in_container_sorting: stored options used to run the sorter
sorter_output: the full output of the sorter (e.g. kilosort .npy files)
waveforms: spikeinterface waveforms output containing AP waveforms for detected spikes
quality_metrics.csv: output of spikeinterface quality check measures

Set Preprocessing Options

Currently supported are multiplexing correction or tshift (termed phase shift here), common median referencing (CMR) (termed common_reference here) and bandpass filtering (bandpass_filter). These options provide an interface to SpikeInterface preprocessing options, more will be added soon.

Preprocessing options are set in yaml configuration files stored in sbi_ephys/sbi_ephys/configs/. A default pipeline is stored in test.yaml.

Custom preprocessing configuration files may be passed to the config_name argument, by passing the full path to the .yaml configuration file. For example:

'preprocessing':
  '1':
  - phase_shift
  - {}
  '2':
  - bandpass_filter
  - freq_min: 300
    freq_max: 6000
  '3':
  - common_reference
  - operator: median
    reference: global

'sorting':
  'kilosort3':
    'car': False
    'freq_min': 300

Configuration files are structured as a dictionary where keys indicate the order to run preprocessing The values hold a list in which the first element is the name of the preprocessing step to run, and the second element a dictionary containing kwargs passed to the spikeinterface function.

Visualise Preprocessing

Visualising preprocesing output can be run locally to inspect output of preprocessing routines. To visualise preprocessing outputs:

from spikewrap.pipeline.preprocess import preprocess
from spikewrap.pipeline.visualise import visualise

base_path = "/ceph/neuroinformatics/neuroinformatics/scratch/ece_ephys_learning"
sub_name = "1110925"
run_name = "1110925_test_shank1"

data = preprocess(base_path=base_path, sub_name=sub_name, run_name=run_name)

visualise(
    data,
    steps="all",
    mode="map",
    as_subplot=True,
    channel_idx_to_show=np.arange(10, 50),
    show_channel_ids=False,
    time_range=(1, 2),
)

This will display a plot showing data from all preprocessing steps, displaying channels with idx 10 - 50, over time period 1-2. Note this requires a GUI (i.e. not run on the HPC terminal) and is best run locally.

spikewrap's People

Contributors

Stargazers

Watchers

spikewrap's Issues

[Feature] Add `preprocessed` as remove intermediate file option

preprocessed was an option to delete_intermediate_files but becase the sorting_data class holds Spikeinterface recording objects that are not closed, an error occurs when trying to delete e.g.:

[WinError 32] The process cannot access the file because it is being used by another process: 'C:\\data\\ephys\\test_data\\steve_multi_run\\1119617\\time-short\\derivatives\\1119617\\1119617_LSE1_shank12_g0\\preprocessed\\si_recording\\traces_cached_seg0.raw'      ```

As a workaround, the hacky below code was used however it is proving unstable

if "preprocessed" in delete_intermediate_files:
    # remove the existing link to the preprocessed data binary.
    # wait time of 5 s is arbitrary.
    # TODO: this feels very hacky. Can unlink the memory map from the segment?
    # Ask SI.
    del sorting_data.data
    gc.collect()
    time.sleep(5)

    if sorting_data.get_preprocessing_path(run_name).is_dir():
        shutil.rmtree(sorting_data.get_preprocessing_path(run_name))


The correct solution is to use SI to close the memmaped file link. However, I can't find on their API - make an issue.

profile GPU useage

it would be worth testing these workflows across different GPUs to see if we can specifically recommend a lower-spec one.

You can specify with e.g. srun -p gpu --gres=gpu:rtx2080:1

based on hardware guide this should be okay to use older ones

GPU: only Nvidia GPUs are supported. Of the currently available GTX or RTX gpus, anything starting at GTX 1660 should work fine, but the speed will typically be proportional to the cost, except for the 2080Ti, which is not 2x faster than 2080 despite being 2x more expensive. The sweet spot is probably the 2070. Do not consider the Titan RTX or Tesla GPUs: they are very expensive for no performance gains.

Expose HPC-related configs

Some configurations are currently hard-coded for SWC HPC but should be theory be easy to edit for other systems. For example, the path where sorters are stored on the HPC is hard-coded, but others setting this on on HPC will also want something similar, just at a different path. Expose all such settings in configs and document well.

[Feature] Validate Input Arguments

Perform checks on all main public function arguments to ensure they are valid.

Look into using pydantic

[Feature] print call signature of `run_full_pipeline` etc. to logs

better error message when preprocessing already exists

the option use_existing_preprocessing can be used to stop spikeinterface throwing error if preprocessing already exists. Currently the exception is SI exception but better to handle ourselves and print that the option use_existing_preprocessing should be used

Properly handle splitting by group.

Currently it is not possible to preoricess or sort by channel group (e.g. shank). It would be nice to expose this critical SI functionality.

preprocessing was added in #151

Bug-fixing log

Some bugs occur frequently, document their fixes here, in time open PRs where relevant to expand exception messagesa:

loops error with singularity - https://docs.sylabs.io/guides/3.5/admin-guide/configfiles.html
KeyError: 'snsMnMaXaDw' is error when ...'exported' is in the name!
stream_name in [] error is whever the path does not exist

[BUG] Singularity does not run on Windows (need to use Docker in this case)

Singularity works much better on HPC. However, is not compatible with windows (not sure about macos). Need to use Docker if not on HPC / on windows or macos

Expose waveform saving options

Add option to configure waveform extraction in configs. e.g:

Extracts waveform on paired Recording-Sorting objects.
Waveforms can be persistent on disk (mode="folder") or in-memory (mode="memory"). By default, waveforms are extracted on a subset of the spikes (max_spikes_per_unit)
and on all channels (dense). If the sparse parameter is set to True, a sparsity is
estimated using a small number of spikes (num_spikes_for_sparsity) and waveforms are
extracted and saved in sparse mode.

[BUG] inter_sample_shift cannot always be found for phase shift

For some reason, the spikeinterface-generated inter_sample_shift property is not always found on SpikeGLX files. This leads phase-shifting to fail with the error

Traceback (most recent call last):
  File "/ceph/neuroinformatics/neuroinformatics/scratch/jziminski/ephys/code/swc_ephys/swc_ephys/examples/./example_full_pipeline.py", line 18, in <module>
    run_full_pipeline(
  File "/ceph/neuroinformatics/neuroinformatics/scratch/jziminski/ephys/code/swc_ephys/swc_ephys/pipeline/full_pipeline.py", line 93, in run_full_pipeline
    data = preprocess(data, pp_steps, verbose)
  File "/ceph/neuroinformatics/neuroinformatics/scratch/jziminski/ephys/code/swc_ephys/swc_ephys/pipeline/preprocess.py", line 55, in preprocess
    perform_preprocessing_step(
  File "/ceph/neuroinformatics/neuroinformatics/scratch/jziminski/ephys/code/swc_ephys/swc_ephys/pipeline/preprocess.py", line 179, in perform_preprocessing_step
    data[new_name] = pp_funcs[pp_name](last_pp_step_output, **pp_options)
  File "/nfs/nhome/live/jziminski/.conda/envs/swc_ephys/lib/python3.10/site-packages/spikeinterface/core/core_tools.py", line 28, in reader_func
    return source_class(*args, **kwargs)
  File "/nfs/nhome/live/jziminski/.conda/envs/swc_ephys/lib/python3.10/site-packages/spikeinterface/preprocessing/phase_shift.py", line 42, in __init__
    assert "inter_sample_shift" in recording.get_property_keys(), "'inter_sample_shift' is not a property!"
AssertionError: 'inter_sample_shift' is not a property!

For example this is currently occurring with run_full_pipeline.py:

base_path = Path(
    r"/ceph/neuroinformatics/neuroinformatics/scratch/jziminski/ephys/test_data/steve_multi_run/1119617"
    r"/time-mid"
)
sub_name = "1119617"  # "cut_same_name"
run_names = ["1119617_LSE1_shank12",
             "1119617_posttest1_shank12",
             "1119617_pretest1_shank12"]

config_name = "test"
sorter = "kilosort2_5"

Add custom types for string literals

e.g. sorter names

Consider how to handle CatGT Multiple gates / triggers

At the moment SpikeGLX gates / triggers are not supproted because a) they are not widely used in the building b) they require additional support in spikeinterface. However, it will be nice to support them at some stage.

Some open questions

should the derivatives output have the gate idx?

Fix inheritence on `PreprocessingData` and `SortingData`

There are currently duplicate attributes / methods e.g. get_logging_path()

Most recent SI changes API such that preprocessing methods cannot be automatically read

As far as I can tell, this method of accessing preprocessing functiosn will need to be updated.

TODO: SI changed their API need to adjust this...
assert pp_funcs[pp_name].__name__ == pp_name, "something is wrong in func dict"

[BUG] on windows need pip install pypiwin32

on windows need pip install pypiwin32 for cuda
now running with pwd error (is a linux thing, stacktrace below

Starting kilosort2_5 sorting...
Traceback (most recent call last):
  File "C:\Users\Joe\work\git-repos\swc_ephys\swc_ephys\examples\example_full_pipeline.py", line 13, in <module>
    run_full_pipeline(
  File "C:\Users\Joe\work\git-repos\swc_ephys\swc_ephys\pipeline\full_pipeline.py", line 60, in run_full_pipeline
    run_sorting(data, sorter, sorter_options, use_existing_preprocessed_file, verbose=verbose)
  File "C:\Users\Joe\work\git-repos\swc_ephys\swc_ephys\pipeline\sort.py", line 67, in run_sorting
    ss.run_sorter(
  File "C:\Users\Joe\miniconda3\envs\swc_ephys\lib\site-packages\spikeinterface\sorters\runsorter.py", line 138, in run_sorter
    return run_sorter_container(
  File "C:\Users\Joe\miniconda3\envs\swc_ephys\lib\site-packages\spikeinterface\sorters\runsorter.py", line 476, in run_sorter_container
    container_client = ContainerClient(mode, container_image, volumes, py_user_base_unix, extra_kwargs)
  File "C:\Users\Joe\miniconda3\envs\swc_ephys\lib\site-packages\spikeinterface\sorters\runsorter.py", line 262, in __init__
    from spython.main import Client
  File "C:\Users\Joe\miniconda3\envs\swc_ephys\lib\site-packages\spython\main\__init__.py", line 77, in <module>
    Client = get_client()
  File "C:\Users\Joe\miniconda3\envs\swc_ephys\lib\site-packages\spython\main\__init__.py", line 19, in get_client
    from .base import Client as client
  File "C:\Users\Joe\miniconda3\envs\swc_ephys\lib\site-packages\spython\main\base\__init__.py", line 9, in <module>
    from spython.utils import check_install, get_singularity_version
  File "C:\Users\Joe\miniconda3\envs\swc_ephys\lib\site-packages\spython\utils\__init__.py", line 3, in <module>
    from .terminal import (
  File "C:\Users\Joe\miniconda3\envs\swc_ephys\lib\site-packages\spython\utils\terminal.py", line 12, in <module>
    import pwd
ModuleNotFoundError: No module named 'pwd'

[Feature] have syntax to show only fully preprocessed data (and vs. raw)

Currently preprocessed data is visialised in a lazy way for all steps, but it is more intuitive and robust to check the final reprocessing output only. Do this, with option to plot side-by-side with raw data

profile speed

preprocessing will run slower than CatGT (pure C++), see how much slower for CatGT supported functions ((tshift, CAR, filtering). SI pay a lot of attention to optimisation so it will probably be easier to see what they have than do ourselves.

expose `export_to_phy`

This is a really nice feature of SI, should do it by default for non kilosort sorters, the only question is how best to expose the configs.

waveforms should be extracted on data that the sorter preprocessed

see spikeinterface #1916

[BUG] Handle local docker image

currently these are pulled to cwd. Nicer to save to home directoy folder for swc_ephys but having trouble passing a pre-existing folder to si singularity_image argument. Might need to handle this explicitly outside of si.

Things to Document

~ syntax for fhome on linux not working / probably won't be supported
-make note that preprocessing on mutli-run calls is all per-segment
currently only spikeglx supported
preprocessing is lazy
Current configs are configured so kilosort does not apply preprocessing that we have applied already. This should be well documented for those building own pipelines
Prior to release, make default configs clear, well documented and canonical.
Make notes on all quality metrics.
Document all exposed options, as well as 'how-to' on choosing every exposed options (e.g. sparsity)

Add ways to suggest parameters to use (e.g. sparisity)

It would be nice to have either documentation or automated routines to suggest parameters to include for configs (e.g. sparsity settings).

handle preprocessing of separate sessions better

current separate sesions are loaded as si recording segments and reprocessed together. This is handy but the output are in one folder. It would be more modular to save as a folder per run and then this can be more felxibly concatenated before sorting

https://spikeinterface.readthedocs.io/en/latest/api.html#spikeinterface.core.WaveformExtractor.save

Remove initialising empty `Path`

BTW for some time I started initialising Path class attributes with an empty Path() to keep the type checker happy, then updating it with the required path later. However, I didn't realise that this literally instantiates an empty Path(.) and if the class variable is unexpectedly not initialised you can do something like shutil.rmtree(my_class.some_path_attribute) which evaluates as shutil.rmtree(Path(.)) (I did, luckily without too much damage).

self.my_path_attr = Path()  # keeps type checker happy
self.my_path_attr: Optional[Path] = None  # safe and explicit but a pain for typing
self.my_path_attr : Path  # easiest for typing but breaks convention to initialise everything in `__init__()

In the end I went for the third option to avoid littering asserts all over the code. But, will re-open if this if it causes issues.

add conda management to checks module

Need to a) check conda environment is called "spikewrap", if not the slurm argument must be used.
b) could add convenience function to setup environment. But maybe this is overkill.

[Feature] Add `split_by_group`, `highpass_spatial_filter`, `silence_periods`, `detect_bad_channels`

These are useful preprocessing steps it would be nice to expose, but they require some care.

Highpass Spatial Filter

This requires the recording to be split-by-group (see below).

Split by Group

This is nice as it allows reprocessing to be performed per-group i.e. per-shank. However, I am not currently sure whether all preproecssing should be performed split by group, and how to handle sorting that is split-by-group. Need to ask on Slack. Handling group splitting is straightforward (it returns a dict of recording objects) thanks to SI's nice API.

silence_periods

This led to some problems with other preprocessing steps, is not a key step and is difficult to get the syntax for the arguments (nested list) correct in .yaml, so leaving this for now

detect_bad_channels

This is sensitive to the preproecssing steps and SI usually run after phase shift and filtering, but before CAR. There is a decision to be made as to whether detect_bad_channels is always run after some fixed preproecssing steps (e.g. filtering), or can be dynamically run as some point in a user preprocessing chain depending on input arguments.

[BUG]Handle bad channels

Currently bad channels can be detected if selected as part of the pipeline but it might be nice to always detect these and give a warning they exist. This could prompt people to exclude these. something like

def handle_bad_channels(data: Data):
    """
    Placeholder function to begin handling bad channel detection. Even if not
    requested, it will always be useful to highlight the bad channels.

    However, it is not clear whether to print these / log these / provide simple
    API argument to remove bad channels.

    TODO
    ----
    Need to determine when best to run this detection. Definately after
    filtering, need to ask on Slack
    see https://spikeinterface.readthedocs.io/en/latest/api.html
    https://github.com/int-brain-lab/ibl-neuropixel/blob/d913ede52117bc79d \
    e77f8dc9cdb407807ab8a8c/src/neurodsp/voltage.py#L445
    """
    bad_channels = spre.detect_bad_channels(data["0-raw"])

    utils.message_user(
        f"\nThe following channels were detected as dead / noise: {bad_channels[0]}\n"
        f"TODO: DO SOMETHING BETTER WITH THIS INFORMATION. SAVE IT SOMEHWERE\n"
        f"You may like to automatically remove bad channels "
        f"by setting [TO IMPLEMENT]] as a preprocessing option\n"
        f"TODO: check how this is handled in SI\n"
    )

handle cuda loading

current it seems module load cuda needs to be run to load cuda drivers prior to spike sorting. Added this to docs but would be nicer to handle with an internal call to system

Supporting more sorters

Some initial problems:
mountainsort4 is hanging (singularity, HPC, no output), unsure why.
spykingcircus2 has extra requirements that should be included in swc_ephys pyproject.toml

doc or handle local cuda install

cuda driver install can be a pain. brainglobe has nice docs on this, but can it be handled internally?

Test TODOs[Feature]

in visualise.py

            # TODO: it is critical this run name is tested extremely
            # thoroughly. We don't want run IDS swapping!
            plot_title = (
                r"$\bf{Run \ name:}$" + f"{data.all_run_names[run_number - 1]}"
                "\n" + r"$\bf{Preprocessing \ step:}$" + f"{full_key}"
            )  # TODO: move to utils, same as below
            # newline must come at start to save space in case not all used

Remove similarity matricies, images

These are useful but need further development and are not supported by SI. For now, keep this strictly as an SI wrapper and develop additional postprocessing through SI / with further consolation with researchers.

[BUG] Understand sorting_without_excess_spikes

Understand what sorting_without_excess_spikes = load_sorting_output(data, recording, sorter) is doing in quality.py.

Si throws an error if not used and says that this must be used. It works but I have no idea what it is doing / what the error is called in the first place.

Automate installation and environment setup

UPDATE: this is stupid, of course it is not possible to use spikewrap to install or setup spikewrap as spikewrap runs IN conda. This will not be possible until looking at fully packaged release which is quite some time away.

UPDATE: cuda driver managerment is too complex for this stage of the project.

automate calls such as module load cuda
cuda driver install can be a pain. brainglobe has nice docs on this, but can it be handled internally?

Validate inputs and streamline API

Currently there is little input arg validation and so some errors on the SI level do not make sense (e.g. wrong steam_id when there is a typo in the path). a) check paths exist etc.

[BUG] clean up intermediate files

need to clean up all intermediate files (e.g. waveforms, preprocessed outputs) by default. This is listed as bug because it is critial prior to deployment

[BUG] Loading sync channel does not load probe

SpikeInterface/spikeinterface#1710

Expose more waveform options

Handle waveforms overwriting / loading existing waveforms.
Use SI sparse waveforms?
'Postprocessing section' of postprocess.py main function - API confusing and messy here, refactor for consistency
check SI is always in uV for every probe - see postprocess.py

[BUG] data class .pkl file will not work if file paths are changed [High Priority]

Double check this. A solution is to wrap any erroring code in exceptions , throw a warning, and provide a new function to reset paths if they have changed.

Testing

check results against existing pipelines from researchers
check SI, CatGT, IBL against each-other

Try `run_sorters` rather than `run_sorter`

run_sorters has the nice feature that you can parallelise backend across CPU's or dask for free. This can be hidden from user and either sorter_name or [sorter_1, sorter_2, ...] is passed at high level.

The only questions are 1) check the difference in signature between run_sorter and run_sorters, check how different kwargs are passed. 2) How best to expose the API options (loop, cpu, dask).

[Feature] Incorporate manual curation from multiple reviewers into final kept clusters

it might be useful to facilitate multiple manual curations in phy. Some people might want to have two people independently curate and only take the units that both curations agree on

[BUG] Windows is not supported

Current Windows is not supported (at least, for sorting) because

singularity is not supported for Windows. need to use docker
pypiwin32 is a dependency (simple as adding to dependency)

Extend preprocessing options

The set of available preprocessing options defined in the config .yaml files can be expanded to include all available options in SpikeInterface. This should be simple, except for bad channel detection / removal which is sensitive to the pre-processing steps it is performed after, and will require more care. Finally, this is a good time to check the problems caused by most recent spikeinterface version API changes.

This issue subsumes #33, #18