Code Monkey home page Code Monkey logo

spikewrap's Introduction

Warning Spikewrap is not sufficiently tested to be used in analysis. This release is only for testing. Do not use for your final analyses.

Warning Limitations

  • works only on SpikeGLX recordings with 1 gate, 1 trigger, 1 probe (per run, e.g. g0, t0, imec0)
  • requires standard input folder format
  • only run one subject / run at a time
  • has limited preprocessing options (tshift, bandpass_filter, common median reference)
  • no options to remove potentially large intermediate files
  • installation / running on HPC is a bit clunky. In future this can be simplified with SLURM jobs organised under the hood and setting up a HPC module.
  • untested!
  • The documentation is currently outdated.

Features

  • preprocess SpikeGLX data (tshift, bandpass_filter, common median reference)
  • spike sorting (kilosort2, kilosort2_5, kilosort3)
  • quality check measures on the sorting results

Local Installation

Sorting requires a NVIDIA GPU and so is currently only available using the SWC's High-Performance Computer (HPC). However, local installation is useful for visualising the preprocessing steps prior to running the full pipeline (see 'Visualisation' below).

To install locally, clone the repository to your local machine using git.

git clone [email protected]:neuroinformatics-unit/spikewrap.git

Change directory to the repo and install using

pip install -e .

or, to also install developer dependencies

pip install -e .[dev]

or if using the zsh shell

pip install -e ".[dev]"

After installation, the module can be imported with import spikewrap.

Running on the HPC

Currently, sorting is required to run on the SWC HPC with access to /ceph/neuroinformatics.

To connect and run on the HPC (e.g. from Windows, macOS or Linux terminal):

ssh [email protected]

ssh hpc-gw1

The first time using, it is necessary to steup and install spikewrap. It is strongly recommended to make a new conda environment on the HPC, before installing spikewrap.

module load miniconda

conda create --name spikewrap python=3.10

conda activate spikewrap

and install spikewrap and it's dependencies:

mkdir ~/git-repos

cd ~/git-repos

git clone https://github.com/JoeZiminski/spikewrap.git

cd spikewrap

pip install -e .

Before running, it is necessary to request use of a GPU node on the HPC to run spike sorting with KiloSort. To run preprocessing and spike sorting, create a script using the API or call from the command line interface (instructions below).

srun -p gpu --gres=gpu:1 -n 8 --mem=40GB --pty bash -i

module load cuda

module load miniconda

conda activate spikewrap

python my_pipeline_script.py

Quick Start Guide

Spikewrap (currently) expects input data to be stored in a rawdata folder. A subject (e.g. mouse) data should be stored in the rawdata folder and contain SpikeGLX output format (example below). Currently, only recordings with 1 gate, 1 trigger and 1 probe are supported (i.e. index 0 for all gate, trigger probe, g0, t0 and imec0).

└── rawdata/
    └── 1110925/
        └── 1110925_test_shank1_g0/
            └── 1110925_test_shank1_g0_imec0/
                ├── 1110925_test_shank1_g0_t0.imec0.ap.bin
                └── 1110925_test_shank1_g0_t0.imec0.ap.meta

API (script)

Example code to analyse this data in this format is below:

from spikewrap.pipeline.full_pipeline import run_full_pipeline

base_path = "/ceph/neuroinformatics/neuroinformatics/scratch/ece_ephys_learning"

if __name__ == "__main__":

    run_full_pipeline(
        base_path=base_path,
        sub_name="1110925",
        run_name="1110925_test_shank1",
        config_name="test",
        sorter="kilosort2_5",
    )

base_path is the path containing the required rawdata folder.

sub_name is the subject to run, and run_name is the SpikeGLX run name to run.

configs_name contains the name of the preprocessing / sorting settings to use (see below)

sorter is the name of the sorter to use (currently supported is kilosort2, kilosort2_5 and kilosort3)

Note run_full_pipline must be run in the if __name__ == "__main__" block as it uses the multiprocessing module.

Output

Output of spike sorting will be in a derivatives folder at the same level as the rawdata. The subfolder organisation of derivatives will match rawdata.

Output are the saved preprocessed data, spike sorting results as well as a list of quality check measures. For example, the full output of a sorting run with the input data as above is:

├── rawdata/
│   └── ...
└── derivatives/
    └── 1110925/
        └── 1110925_test_shank1_g0  /
            └── 1110925_test_shank1_g0_imec0/
                ├── preprocessed/
                │   ├── data_class.pkl
                │   └── si_recording
                ├── kilosort2_5-sorting/
                    ├── in_container_sorting/
                    ├── sorter_output/
                    ├── waveforms/
                    │   └── <spikeinterface waveforms output>
                    ├── quality_metrics.csv
                    ├── spikeinterface_log.json
                    ├── spikeinterface_params.json
                    └── spikeinterface_recording.json

preprocessed:

  • Binary-format spikeinterface recording from the final preprocessing step (si_recording) 2) data_class.pkl spikewrap internal use.

-sorting output (e.g. kilosort2_5-sorting, multiple sorters can be run):

  • in_container_sorting: stored options used to run the sorter

  • sorter_output: the full output of the sorter (e.g. kilosort .npy files)

  • waveforms: spikeinterface waveforms output containing AP waveforms for detected spikes

  • quality_metrics.csv: output of spikeinterface quality check measures

Set Preprocessing Options

Currently supported are multiplexing correction or tshift (termed phase shift here), common median referencing (CMR) (termed common_reference here) and bandpass filtering (bandpass_filter). These options provide an interface to SpikeInterface preprocessing options, more will be added soon.

Preprocessing options are set in yaml configuration files stored in sbi_ephys/sbi_ephys/configs/. A default pipeline is stored in test.yaml.

Custom preprocessing configuration files may be passed to the config_name argument, by passing the full path to the .yaml configuration file. For example:

'preprocessing':
  '1':
  - phase_shift
  - {}
  '2':
  - bandpass_filter
  - freq_min: 300
    freq_max: 6000
  '3':
  - common_reference
  - operator: median
    reference: global

'sorting':
  'kilosort3':
    'car': False
    'freq_min': 300

Configuration files are structured as a dictionary where keys indicate the order to run preprocessing The values hold a list in which the first element is the name of the preprocessing step to run, and the second element a dictionary containing kwargs passed to the spikeinterface function.

Visualise Preprocessing

Visualising preprocesing output can be run locally to inspect output of preprocessing routines. To visualise preprocessing outputs:

from spikewrap.pipeline.preprocess import preprocess
from spikewrap.pipeline.visualise import visualise

base_path = "/ceph/neuroinformatics/neuroinformatics/scratch/ece_ephys_learning"
sub_name = "1110925"
run_name = "1110925_test_shank1"

data = preprocess(base_path=base_path, sub_name=sub_name, run_name=run_name)

visualise(
    data,
    steps="all",
    mode="map",
    as_subplot=True,
    channel_idx_to_show=np.arange(10, 50),
    show_channel_ids=False,
    time_range=(1, 2),
)

This will display a plot showing data from all preprocessing steps, displaying channels with idx 10 - 50, over time period 1-2. Note this requires a GUI (i.e. not run on the HPC terminal) and is best run locally.

plot

spikewrap's People

Contributors

joeziminski avatar pre-commit-ci[bot] avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar

spikewrap's Issues

[Feature] Add `preprocessed` as remove intermediate file option

preprocessed was an option to delete_intermediate_files but becase the sorting_data class holds Spikeinterface recording objects that are not closed, an error occurs when trying to delete e.g.:

[WinError 32] The process cannot access the file because it is being used by another process: 'C:\\data\\ephys\\test_data\\steve_multi_run\\1119617\\time-short\\derivatives\\1119617\\1119617_LSE1_shank12_g0\\preprocessed\\si_recording\\traces_cached_seg0.raw'      ```

As a workaround, the hacky below code was used however it is proving unstable

if "preprocessed" in delete_intermediate_files:
    # remove the existing link to the preprocessed data binary.
    # wait time of 5 s is arbitrary.
    # TODO: this feels very hacky. Can unlink the memory map from the segment?
    # Ask SI.
    del sorting_data.data
    gc.collect()
    time.sleep(5)

    if sorting_data.get_preprocessing_path(run_name).is_dir():
        shutil.rmtree(sorting_data.get_preprocessing_path(run_name))

The correct solution is to use SI to close the memmaped file link. However, I can't find on their API - make an issue.

profile GPU useage

it would be worth testing these workflows across different GPUs to see if we can specifically recommend a lower-spec one.

You can specify with e.g. srun -p gpu --gres=gpu:rtx2080:1

based on hardware guide this should be okay to use older ones

GPU: only Nvidia GPUs are supported. Of the currently available GTX or RTX gpus, anything starting at GTX 1660 should work fine, but the speed will typically be proportional to the cost, except for the 2080Ti, which is not 2x faster than 2080 despite being 2x more expensive. The sweet spot is probably the 2070. Do not consider the Titan RTX or Tesla GPUs: they are very expensive for no performance gains.

Expose HPC-related configs

Some configurations are currently hard-coded for SWC HPC but should be theory be easy to edit for other systems. For example, the path where sorters are stored on the HPC is hard-coded, but others setting this on on HPC will also want something similar, just at a different path. Expose all such settings in configs and document well.

better error message when preprocessing already exists

the option use_existing_preprocessing can be used to stop spikeinterface throwing error if preprocessing already exists. Currently the exception is SI exception but better to handle ourselves and print that the option use_existing_preprocessing should be used

Properly handle splitting by group.

Currently it is not possible to preoricess or sort by channel group (e.g. shank). It would be nice to expose this critical SI functionality.

preprocessing was added in #151

Expose waveform saving options

Add option to configure waveform extraction in configs. e.g:

Extracts waveform on paired Recording-Sorting objects.
Waveforms can be persistent on disk (mode="folder") or in-memory (mode="memory"). By default, waveforms are extracted on a subset of the spikes (max_spikes_per_unit)
and on all channels (dense). If the sparse parameter is set to True, a sparsity is
estimated using a small number of spikes (num_spikes_for_sparsity) and waveforms are
extracted and saved in sparse mode.

[BUG] inter_sample_shift cannot always be found for phase shift

For some reason, the spikeinterface-generated inter_sample_shift property is not always found on SpikeGLX files. This leads phase-shifting to fail with the error

Traceback (most recent call last):
  File "/ceph/neuroinformatics/neuroinformatics/scratch/jziminski/ephys/code/swc_ephys/swc_ephys/examples/./example_full_pipeline.py", line 18, in <module>
    run_full_pipeline(
  File "/ceph/neuroinformatics/neuroinformatics/scratch/jziminski/ephys/code/swc_ephys/swc_ephys/pipeline/full_pipeline.py", line 93, in run_full_pipeline
    data = preprocess(data, pp_steps, verbose)
  File "/ceph/neuroinformatics/neuroinformatics/scratch/jziminski/ephys/code/swc_ephys/swc_ephys/pipeline/preprocess.py", line 55, in preprocess
    perform_preprocessing_step(
  File "/ceph/neuroinformatics/neuroinformatics/scratch/jziminski/ephys/code/swc_ephys/swc_ephys/pipeline/preprocess.py", line 179, in perform_preprocessing_step
    data[new_name] = pp_funcs[pp_name](last_pp_step_output, **pp_options)
  File "/nfs/nhome/live/jziminski/.conda/envs/swc_ephys/lib/python3.10/site-packages/spikeinterface/core/core_tools.py", line 28, in reader_func
    return source_class(*args, **kwargs)
  File "/nfs/nhome/live/jziminski/.conda/envs/swc_ephys/lib/python3.10/site-packages/spikeinterface/preprocessing/phase_shift.py", line 42, in __init__
    assert "inter_sample_shift" in recording.get_property_keys(), "'inter_sample_shift' is not a property!"
AssertionError: 'inter_sample_shift' is not a property!

For example this is currently occurring with run_full_pipeline.py:

base_path = Path(
    r"/ceph/neuroinformatics/neuroinformatics/scratch/jziminski/ephys/test_data/steve_multi_run/1119617"
    r"/time-mid"
)
sub_name = "1119617"  # "cut_same_name"
run_names = ["1119617_LSE1_shank12",
             "1119617_posttest1_shank12",
             "1119617_pretest1_shank12"]

config_name = "test"
sorter = "kilosort2_5"

Consider how to handle CatGT Multiple gates / triggers

At the moment SpikeGLX gates / triggers are not supproted because a) they are not widely used in the building b) they require additional support in spikeinterface. However, it will be nice to support them at some stage.

Some open questions

  • should the derivatives output have the gate idx?

[BUG] on windows need pip install pypiwin32

  1. on windows need pip install pypiwin32 for cuda
  2. now running with pwd error (is a linux thing, stacktrace below
Starting kilosort2_5 sorting...
Traceback (most recent call last):
  File "C:\Users\Joe\work\git-repos\swc_ephys\swc_ephys\examples\example_full_pipeline.py", line 13, in <module>
    run_full_pipeline(
  File "C:\Users\Joe\work\git-repos\swc_ephys\swc_ephys\pipeline\full_pipeline.py", line 60, in run_full_pipeline
    run_sorting(data, sorter, sorter_options, use_existing_preprocessed_file, verbose=verbose)
  File "C:\Users\Joe\work\git-repos\swc_ephys\swc_ephys\pipeline\sort.py", line 67, in run_sorting
    ss.run_sorter(
  File "C:\Users\Joe\miniconda3\envs\swc_ephys\lib\site-packages\spikeinterface\sorters\runsorter.py", line 138, in run_sorter
    return run_sorter_container(
  File "C:\Users\Joe\miniconda3\envs\swc_ephys\lib\site-packages\spikeinterface\sorters\runsorter.py", line 476, in run_sorter_container
    container_client = ContainerClient(mode, container_image, volumes, py_user_base_unix, extra_kwargs)
  File "C:\Users\Joe\miniconda3\envs\swc_ephys\lib\site-packages\spikeinterface\sorters\runsorter.py", line 262, in __init__
    from spython.main import Client
  File "C:\Users\Joe\miniconda3\envs\swc_ephys\lib\site-packages\spython\main\__init__.py", line 77, in <module>
    Client = get_client()
  File "C:\Users\Joe\miniconda3\envs\swc_ephys\lib\site-packages\spython\main\__init__.py", line 19, in get_client
    from .base import Client as client
  File "C:\Users\Joe\miniconda3\envs\swc_ephys\lib\site-packages\spython\main\base\__init__.py", line 9, in <module>
    from spython.utils import check_install, get_singularity_version
  File "C:\Users\Joe\miniconda3\envs\swc_ephys\lib\site-packages\spython\utils\__init__.py", line 3, in <module>
    from .terminal import (
  File "C:\Users\Joe\miniconda3\envs\swc_ephys\lib\site-packages\spython\utils\terminal.py", line 12, in <module>
    import pwd
ModuleNotFoundError: No module named 'pwd'

profile speed

preprocessing will run slower than CatGT (pure C++), see how much slower for CatGT supported functions ((tshift, CAR, filtering). SI pay a lot of attention to optimisation so it will probably be easier to see what they have than do ourselves.

expose `export_to_phy`

This is a really nice feature of SI, should do it by default for non kilosort sorters, the only question is how best to expose the configs.

[BUG] Handle local docker image

currently these are pulled to cwd. Nicer to save to home directoy folder for swc_ephys but having trouble passing a pre-existing folder to si singularity_image argument. Might need to handle this explicitly outside of si.

Things to Document

  • ~ syntax for fhome on linux not working / probably won't be supported
  • -make note that preprocessing on mutli-run calls is all per-segment
  • currently only spikeglx supported
  • preprocessing is lazy
  • Current configs are configured so kilosort does not apply preprocessing that we have applied already. This should be well documented for those building own pipelines
  • Prior to release, make default configs clear, well documented and canonical.
  • Make notes on all quality metrics.
  • Document all exposed options, as well as 'how-to' on choosing every exposed options (e.g. sparsity)

Remove initialising empty `Path`

BTW for some time I started initialising Path class attributes with an empty Path() to keep the type checker happy, then updating it with the required path later. However, I didn't realise that this literally instantiates an empty Path(.) and if the class variable is unexpectedly not initialised you can do something like shutil.rmtree(my_class.some_path_attribute) which evaluates as shutil.rmtree(Path(.)) (I did, luckily without too much damage).

self.my_path_attr = Path()  # keeps type checker happy
self.my_path_attr: Optional[Path] = None  # safe and explicit but a pain for typing
self.my_path_attr : Path  # easiest for typing but breaks convention to initialise everything in `__init__()

In the end I went for the third option to avoid littering asserts all over the code. But, will re-open if this if it causes issues.

add conda management to checks module

Need to a) check conda environment is called "spikewrap", if not the slurm argument must be used.
b) could add convenience function to setup environment. But maybe this is overkill.

[Feature] Add `split_by_group`, `highpass_spatial_filter`, `silence_periods`, `detect_bad_channels`

These are useful preprocessing steps it would be nice to expose, but they require some care.

Highpass Spatial Filter

This requires the recording to be split-by-group (see below).

Split by Group

This is nice as it allows reprocessing to be performed per-group i.e. per-shank. However, I am not currently sure whether all preproecssing should be performed split by group, and how to handle sorting that is split-by-group. Need to ask on Slack. Handling group splitting is straightforward (it returns a dict of recording objects) thanks to SI's nice API.

silence_periods

This led to some problems with other preprocessing steps, is not a key step and is difficult to get the syntax for the arguments (nested list) correct in .yaml, so leaving this for now

detect_bad_channels

This is sensitive to the preproecssing steps and SI usually run after phase shift and filtering, but before CAR. There is a decision to be made as to whether detect_bad_channels is always run after some fixed preproecssing steps (e.g. filtering), or can be dynamically run as some point in a user preprocessing chain depending on input arguments.

[BUG]Handle bad channels

Currently bad channels can be detected if selected as part of the pipeline but it might be nice to always detect these and give a warning they exist. This could prompt people to exclude these. something like

def handle_bad_channels(data: Data):
    """
    Placeholder function to begin handling bad channel detection. Even if not
    requested, it will always be useful to highlight the bad channels.

    However, it is not clear whether to print these / log these / provide simple
    API argument to remove bad channels.

    TODO
    ----
    Need to determine when best to run this detection. Definately after
    filtering, need to ask on Slack
    see https://spikeinterface.readthedocs.io/en/latest/api.html
    https://github.com/int-brain-lab/ibl-neuropixel/blob/d913ede52117bc79d \
    e77f8dc9cdb407807ab8a8c/src/neurodsp/voltage.py#L445
    """
    bad_channels = spre.detect_bad_channels(data["0-raw"])

    utils.message_user(
        f"\nThe following channels were detected as dead / noise: {bad_channels[0]}\n"
        f"TODO: DO SOMETHING BETTER WITH THIS INFORMATION. SAVE IT SOMEHWERE\n"
        f"You may like to automatically remove bad channels "
        f"by setting [TO IMPLEMENT]] as a preprocessing option\n"
        f"TODO: check how this is handled in SI\n"
    )

handle cuda loading

current it seems module load cuda needs to be run to load cuda drivers prior to spike sorting. Added this to docs but would be nicer to handle with an internal call to system

Supporting more sorters

Some initial problems:
mountainsort4 is hanging (singularity, HPC, no output), unsure why.
spykingcircus2 has extra requirements that should be included in swc_ephys pyproject.toml

Test TODOs[Feature]

in visualise.py

            # TODO: it is critical this run name is tested extremely
            # thoroughly. We don't want run IDS swapping!
            plot_title = (
                r"$\bf{Run \ name:}$" + f"{data.all_run_names[run_number - 1]}"
                "\n" + r"$\bf{Preprocessing \ step:}$" + f"{full_key}"
            )  # TODO: move to utils, same as below
            # newline must come at start to save space in case not all used

Remove similarity matricies, images

These are useful but need further development and are not supported by SI. For now, keep this strictly as an SI wrapper and develop additional postprocessing through SI / with further consolation with researchers.

[BUG] Understand sorting_without_excess_spikes

Understand what sorting_without_excess_spikes = load_sorting_output(data, recording, sorter) is doing in quality.py.

Si throws an error if not used and says that this must be used. It works but I have no idea what it is doing / what the error is called in the first place.

Automate installation and environment setup

UPDATE: this is stupid, of course it is not possible to use spikewrap to install or setup spikewrap as spikewrap runs IN conda. This will not be possible until looking at fully packaged release which is quite some time away.

UPDATE: cuda driver managerment is too complex for this stage of the project.

  • automate calls such as module load cuda
  • cuda driver install can be a pain. brainglobe has nice docs on this, but can it be handled internally?

Validate inputs and streamline API

Currently there is little input arg validation and so some errors on the SI level do not make sense (e.g. wrong steam_id when there is a typo in the path). a) check paths exist etc.

[BUG] clean up intermediate files

need to clean up all intermediate files (e.g. waveforms, preprocessed outputs) by default. This is listed as bug because it is critial prior to deployment

Expose more waveform options

  • Handle waveforms overwriting / loading existing waveforms.
  • Use SI sparse waveforms?
  • 'Postprocessing section' of postprocess.py main function - API confusing and messy here, refactor for consistency
  • check SI is always in uV for every probe - see postprocess.py

Testing

  • check results against existing pipelines from researchers
  • check SI, CatGT, IBL against each-other

Try `run_sorters` rather than `run_sorter`

run_sorters has the nice feature that you can parallelise backend across CPU's or dask for free. This can be hidden from user and either sorter_name or [sorter_1, sorter_2, ...] is passed at high level.

The only questions are 1) check the difference in signature between run_sorter and run_sorters, check how different kwargs are passed. 2) How best to expose the API options (loop, cpu, dask).

[BUG] Windows is not supported

Current Windows is not supported (at least, for sorting) because

  • singularity is not supported for Windows. need to use docker
  • pypiwin32 is a dependency (simple as adding to dependency)

Extend preprocessing options

The set of available preprocessing options defined in the config .yaml files can be expanded to include all available options in SpikeInterface. This should be simple, except for bad channel detection / removal which is sensitive to the pre-processing steps it is performed after, and will require more care. Finally, this is a good time to check the problems caused by most recent spikeinterface version API changes.

This issue subsumes #33, #18

some TODOs

This was a messy list of TODOs which have now been separated out into difference issues

Add logging

Need to add logs to update at each state, in particular when kilosort is running. This doesn't need to be super advanced at this stage but informative

[BUG] Adjust memory based on file size to sort

recently a 300GB kilosort 2.5 sort job had processed killed with 40GB memory. In future it would be good to manager this ourselves based on file size to sort. Here log all failed jobs. Trying with 120 GB

300GB Kilosort 2 job. Fail: 40GB.

Consider renaming to `swc-ephys`

The smallest of small suggestions, but every other repo we have uses hyphens, and it would be great if these were consistent.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.