gweindel / hmp Goto Github PK

View Code? Open in Web Editor NEW

32.0 3.0 8.0 1.6 GB

Repository for the hmp python package

License: BSD 3-Clause "New" or "Revised" License

Python 1.36% Jupyter Notebook 98.64%

chronometry eeg electrophysiology meg neuroscience signal-processing

hmp's Introduction

HMP

Note

The preprint is now live : Trial-by-trial detection of cognitive events in neural time-series

HMP is an open-source Python package to analyze neural time-series (e.g. EEG) to estimate Hidden Multivariate Patterns (HMP). HMP is described in Weindel, van Maanen & Borst (2024, see the preprint on biorXiv) and is a generalized and simplified version of the HsMM-MVPA method developed by Anderson, Zhang, Borst, & Walsh (2016).

As a summary of the method, an HMP model parses the reaction time into a number of successive events determined based on patterns in a neural time-serie. Hence any reaction time can then be described by a number of cognitive events and the duration between them estimated using HMP. The important aspect of HMP is that it is a whole-brain analysis (or whole scalp analysis) that estimates the onset of events on a single-trial basis. These by-trial estimates allow you then to further dig into any aspect you are interested in a signal:

Describing an experiment or a clinical sample in terms of events detected in the EEG signal
Describing experimental effects based on the time onset of a particular event
Estimating the effect of trial-wise manipulations on the identified event presence and time occurrence (e.g. the by-trial variation of stimulus strength or the effect of time-on-task)
Time-lock EEG signal to the onset of a given event and perform classical ERPs or time-frequency analysis based on the onset of a new event
And many more (e.g. evidence accumulation models, classification based on the number of events in the signal,...)

Documentation

The package is available through pip. A way of using the package is to use a conda environment (see anaconda for how to install conda):

    conda create -n hmp 
    conda activate hmp
    conda install pip #if not already installed
    pip install hmp #or cloning from github, see below

For the cutting edge version you can clone the repository using git

Open a terminal and type:

$ git clone https://github.com/gweindel/hmp.git

Then move to the clone repository and run

$ pip install -e .

After either installation path you can import hmp in your favorite python IDE through:

    import hmp

To get started

To get started with the code:

Check the demo below
Inspect the tutorials in the tutorials repository

Demo on simulated data

The following section will quickly walk you through an example usage in simulated data (using MNE's simulation functions and tutorials)

First we load the packages necessary for the demo on simulated data

Importing libraries

## Importing these packages is specific for this simulation case
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import gamma
import seaborn as sns
from hmp import simulations#that module is only to simulate things, no need to import when analyzing data

## Importing HMP as you would to analyze data
import hmp

Simulating data

In the following code block we simulate 50 trials with four HMP events defined as the activation of four neural sources (in the source space of MNE's sample participant). This is not code you would need for your own analysis except if you'd want to simulate and test properties of HMP models. All four sources are defined by a location in sensor space, an activation amplitude and a distribution in time (here a gamma with shape and scale parameters) for the onsets of the events on each trial. The simulation functions are based on this MNE tutorial .

_If you're running this for the first time a 1.65 G file will be downloaded in order to perform the simulation but this will be done only once (alternatively you can just download the corresponding simulation file and place it in the same folder from where you are running this notebook)

cpus = 1 # For multiprocessing, usually a good idea to use multiple CPUs as long as you have enough RAM

n_trials = 200 #Number of trials to simulate, we use 200 to get nice ERPs but you can reduce for speed

##### Here we define the sources of the brain activity (event) for each trial
sfreq = 500#sampling frequency of the signal
n_events = 4
frequency = 10.#Frequency of the event defining its duration, half-sine of 10Hz = 50ms
amplitude = .35e-6 #Amplitude of the event in nAm, defining signal to noise ratio
shape = 2 #shape of the gamma distribution
scales = np.array([50, 150, 200, 250, 100])/shape #Mean duration of the time between each event in ms
names = simulations.available_sources()[[22,33,55,44,0]]#Which source to activate for each event (see atlas when calling simulations.available_sources())

sources = []
for source in zip(names, scales):#One source = one frequency/event width, one amplitude and a given by-trial variability distribution
    sources.append([source[0], frequency, amplitude, gamma(shape, scale=source[1])])

# Function used to generate the data
file = simulations.simulate(sources, n_trials, cpus, 'dataset_README',  overwrite=False, sfreq=sfreq, seed=1)
#load electrode position, specific to the simulations
positions = simulations.simulation_positions()

/home/gweindel/owncloud/projects/RUGUU/main_hmp/hmp/simulations.py:190: UserWarning: ./dataset_README_raw.fif exists no new simulation performed
  warn(f'{subj_file} exists no new simulation performed', UserWarning)

Creating the event structure and plotting the raw data

To recover the data we need to create the event structure based on the triggers created during simulation. This is the same as analyzing real EEG data and recovering events in the stimulus channel. In our case 1 signal the onset of the stimulus and 6 the moment of the response. Hence a trial is defined as the times occuring between the triggers 1 and 6.

#Recovering the events to epoch the data (in the number of trials defined above)
generating_events = np.load(file[1])
resp_trigger = int(np.max(np.unique(generating_events[:,2])))#Resp trigger is the last source in each trial
event_id = {'stimulus':1}#trigger 1 = stimulus
resp_id = {'response':resp_trigger}
#Keeping only stimulus and response triggers
events = generating_events[(generating_events[:,2] == 1) | (generating_events[:,2] == resp_trigger)]#only retain stimulus and response triggers

#Visualising the raw simulated EEG data
import mne
raw = mne.io.read_raw_fif(file[0], preload=False, verbose=False)
raw.pick_types(eeg=True).plot(scalings=dict(eeg=1e-5), events=events, block=True);

Recovering number of events as well as actual by-trial variation

To see how well HMP does at recovering by-trial events, first we get the ground truth from our simulation. Unfortunately, with an actual dataset you don’t have access to this, of course.

%matplotlib inline
#Recover the actual time of the simulated events
sim_event_times = np.reshape(np.ediff1d(generating_events[:,0],to_begin=0)[generating_events[:,2] > 1], \
           (n_trials, n_events+1))
sim_event_times_cs = np.cumsum(sim_event_times, axis=1)

Demo for a single participant in a single condition based on the simulated data

First we read the EEG data as we would for a single participant

# Reading the data
eeg_data = hmp.utils.read_mne_data(file[0], event_id=event_id, resp_id=resp_id, sfreq=sfreq, 
            events_provided=events, verbose=False)

Processing participant ./dataset_README_raw.fif's continuous eeg


/home/gweindel/miniconda3/envs/main_hmp/lib/python3.12/site-packages/mne/epochs.py:2986: FutureWarning: Setting an item of incompatible dtype is deprecated and will raise in a future error of pandas. Value '' has dtype incompatible with float64, please explicitly cast to a compatible dtype first.
  metadata.iloc[:, 0] = ""


200 trials were retained for participant ./dataset_README_raw.fif

HMP uses xarray named dimension matrices, allowing to directly manipulate the data using the name of the dimensions:

#example of usage of xarray
print(eeg_data)
eeg_data.sel(channels=['EEG 001','EEG 002','EEG 003'], samples=range(400))\
    .data.groupby('samples').mean(['participant','epochs']).plot.line(hue='channels');

<xarray.Dataset> Size: 87MB
Dimensions:      (participant: 1, epochs: 200, channels: 59, samples: 917)
Coordinates:
  * epochs       (epochs) int64 2kB 0 1 2 3 4 5 6 ... 194 195 196 197 198 199
  * channels     (channels) <U7 2kB 'EEG 001' 'EEG 002' ... 'EEG 059' 'EEG 060'
  * samples      (samples) int64 7kB 0 1 2 3 4 5 6 ... 911 912 913 914 915 916
    event_name   (epochs) object 2kB 'stimulus' 'stimulus' ... 'stimulus'
    rt           (epochs) float64 2kB 0.394 0.926 0.802 ... 1.082 0.312 0.376
  * participant  (participant) <U2 8B 'S0'
Data variables:
    data         (participant, epochs, channels, samples) float64 87MB 5.033e...
Attributes:
    sfreq:           500.0
    offset:          0
    lowpass:         40.0
    highpass:        0.10000000149011612
    lower_limit_RT:  0
    upper_limit_RT:  5.002

Next we transform the data by applying a spatial principal components analysis (PCA) to reduce the dimensionality of the data.

In this case we choose 5 principal components for commodity and speed. Note that if the number of components to retain is not specified, the scree plot of the PCA is displayed and a prompt ask how many PCs should be retained (in this case we specify it as building the README does not allow for prompts)

hmp_data = hmp.utils.transform_data(eeg_data, apply_standard=False, n_comp=5)

HMP model

Once the data is in the expected format, we can initialize an HMP object; note that no estimation is performed yet.

init = hmp.models.hmp(hmp_data, eeg_data, cpus=1, )#Initialization of the model

Estimating an HMP model

We can directly fit an HMP model without giving any info on the number of events (see tutorial 2 for a detailed explanation of the following cell:

estimates = init.fit()

  0%|          | 0/388 [00:00<?, ?it/s]


Transition event 1 found around sample 25


Transition event 2 found around sample 110


Transition event 3 found around sample 213


Transition event 4 found around sample 341



All events found, refitting final combination.
Estimating 4 events model


parameters estimated for 4 events model

Visualizing results of the fit

In the previous cell we initiated an HMP model looking for default 50ms half-sine in the EEG signal. The method discovered four events, and therefore five gamma-distributed time distributions with a fixed shape of 2 and an estimated scale. We can now inspect the results of the fit.

We can directly take a look to the topologies and latencies of the events by calling hmp.visu.plot_topo_timecourse

hmp.visu.plot_topo_timecourse(eeg_data, estimates, #Data and estimations 
                               positions, init,#position of the electrodes and initialized model
                               magnify=1, sensors=False)#Display control parameters

This shows us the electrode activity on the scalp as well as the average time of occurrence of the events.

But HMP doesn't provide point estimate but location probability of each of the detected event and that for each trial:

hmp.visu.plot_distribution(estimates.eventprobs.sel(trial_x_participant=('S0',1)), 
                            xlims=(0,np.percentile(sim_event_times.sum(axis=1), q=90)));

This then shows the likeliest event onset location in time for the last simulated trial!

Comparing with ground truth

As we simulated the data we have access to the ground truth of the underlying generative events. We can then compare the average time of simulated event onset to the one estimated by HMP.

hmp.visu.plot_topo_timecourse(eeg_data, estimates, positions, init, magnify=1, sensors=False, figsize=(13,1), title='Actual vs estimated event onsets',
        times_to_display = np.mean(np.cumsum(sim_event_times,axis=1),axis=0))

We see that the HMP recovered the exact average location of the bumps defined in the simulated data.

We can further test how well the method did by comparing the generated single trial onsets with those estimated from the HMP model

colors = sns.color_palette(None, n_events+1)
fig, ax = plt.subplots(n_events,2, figsize=(10,3*n_events), dpi=100, sharex=False)
i = 0
ax[0,0].set_title('Point-wise')
ax[0,1].set_title('Distribution')
ax[-1,0].set_xlabel(f'Simulated by-trial event time {i+1}')

estimated_times = init.compute_times(init, estimates, duration=False, mean=False, add_rt=False).T
for event in estimated_times:
    event_locations = sim_event_times_cs[:,i].T
    sns.regplot(x=event_locations, y=event, ax=ax[i,0], color=colors[i])
    ax[i,0].plot([np.min(event), np.max(event)], [np.min(event), np.max(event)],'--', color='k')
    ax[i,0].set_ylabel(f'Estimated by-trial event onset {i+1}')
    sns.kdeplot(event, ax=ax[i,1], color='orange')
    sns.kdeplot(event_locations, ax=ax[i,1], color='k')
    i+= 1

plt.tight_layout();

We see that every events gets nicely recovered even on a by-trial basis!

We have by-trial estimations. Now what?

HMP just helped us estimate at which time point single-trial events occured. There are plenty of things to do with this information.

As an example consider how average curve of traditional event related potentials (ERP) in EEG are a bad representation of the underlying single-trial event. In the following cell we do the average ERP electrode activity for 4 subselected electrodes in the classical way (stimulus centered) and in a new way that HMP allows, by-trial resampled signal

fig, ax = plt.subplots(1,2, figsize=(6,2), sharey=True, sharex=True, dpi=200)
colors = iter([plt.cm.tab10(i) for i in range(10)])

for channel in  ['EEG 031', 'EEG 039', 'EEG 040', 'EEG 048']:
    c = next(colors)
    fakefit = init.fit_single(2, maximization=False, verbose=False)#Just to get the stim ERP in the same format
    BRP_times = init.compute_times(init, fakefit, fill_value=0, add_rt=True)
    times = BRP_times.sel(event=[0,3])#Stim and response only
    times['event'] = [0,1]
    erp_data = hmp.visu.erp_data(eeg_data.stack(trial_x_participant=["participant","epochs"]), times, channel)
    hmp.visu.plot_erp(times, erp_data, c, ax[0], upsample=2, label=channel)
    BRP_times = init.compute_times(init, estimates, fill_value=0, add_rt=True)#Real estimate
    erp_data = hmp.visu.erp_data(eeg_data.stack(trial_x_participant=["participant","epochs"]), BRP_times, channel,100)
    hmp.visu.plot_erp(BRP_times, erp_data, c, ax[1], upsample=2)
ev_colors = iter(['red', 'purple','brown','black',])
for event in range(4):
    c = next(ev_colors)
    ax[1].vlines(sim_event_times_cs[:,event].mean()*2, ymin=-3e-6, ymax=3e-6, color=c, alpha=.75)
ax[0].set_xlabel('Time (ms) from stimulus')
ax[1].set_xlabel('(Resampled) Time (ms) from stimulus')
plt.xlim(0,800)
ax[0].legend(bbox_to_anchor=(2.9,.85))
plt.show()

In the right panel the signal between each event is resampled to have the same duration as the average between event times (in order to be represented along with the classical ERP on the left). Do you see the gain in signal-to-noise ratio? This is because prior to averaging we centered all trials around the most likely event onset time detected by HMP!

Now in order to display both ERPs side-by-side we had to transform the signal by quite a bit. Instead of doing this resampling procedure we can also just look at the activity of those four electrodes in the vincinity of the detected events:

import pandas as pd
fig, ax = plt.subplots(1,4, figsize=(9,2), sharey=True, sharex=True, dpi=200)
colors = iter([plt.cm.tab10(i) for i in range(10)])

data = eeg_data.stack({'trial_x_participant':['participant','epochs']}).data.dropna('trial_x_participant', how="all")
times = init.compute_times(init, estimates, fill_value=0, add_rt=True)

# Plotting the single trial aligned events
baseline, n_samples = -50, 50#Take 50 samples on both sides, i.e. 100ms in a 500Hz signal
ev_colors = iter(['red', 'purple','brown','black',])
for i, event in enumerate(times.event[1:-1]):
    c = next(ev_colors)
    centered = hmp.utils.centered_activity(data, times, ['EEG 031',  'EEG 040', 'EEG 048'], event=event, baseline=baseline, n_samples=n_samples)
    ax[i].plot(centered.samples*2, centered.data.unstack().mean(['trials', 'channel', 'participant']).data, color=c)
    ax[i].set(title=f"Event {event.values}", ylim=(-5.5e-6, 5.5e-6), xlabel=f'Time (ms) around {event.values}')
    if i == 0:
        ax[i].set_ylabel("Voltage")

plt.xlim(-100,100);

Compared to the traditional ERP representation we see that HMP provides a much better view of the underlying single-trial activities than the tradition ERP.

Now HMP is not merely a method to look at ERPs or by-trial times. A lot can be done once the estimation has been made, for example:

Connectivity analysis starting from each event onset
Time/frequency decomposition
Single-trial signal analysis with, e.g. a single-trial covariate
and so on

Follow-up

For examples on how to use the package on real data, or to compare event time onset across conditions see the tutorial notebooks:

Bibliography

HMP is very flexible regarding what type of pattern, type of distribution and data to fit HMP to. Readers interested in the original HsMM-MVPA method (which can be seen as a specific implementation of HMP) can take a look at the paper by Anderson, Zhang, Borst, & Walsh (2016) as well as the book chapter by Borst & Anderson (2021). The following list contains a non-exhaustive list of papers published using the original HsMM-MVPA method:

Anderson, J.R., Borst, J.P., Fincham, J.M., Ghuman, A.S., Tenison, C., & Zhang, Q. (2018). The Common Time Course of Memory Processes Revealed. Psychological Science 29(9), pp. 1463-1474. link
Berberyan, H.S., Van Rijn, H., & Borst, J.P. (2021). Discovering the brain stages of lexical decision: Behavioral effects originate from a single neural decision process. Brain and Cognition 153: 105786. link
Berberyan, H. S., van Maanen, L., van Rijn, H., & Borst, J. (2021). EEG-based identification of evidence accumulation stages in decision-making. Journal of Cognitive Neuroscience, 33(3), 510-527. link
Van Maanen, L., Portoles, O., & Borst, J. P. (2021). The discovery and interpretation of evidence accumulation stages. Computational brain & behavior, 4(4), 395-415. link
Portoles, O., Blesa, M., van Vugt, M., Cao, M., & Borst, J. P. (2022). Thalamic bursts modulate cortical synchrony locally to switch between states of global functional connectivity in a cognitive task. PLoS computational biology, 18(3), e1009407. link
Portoles, O., Borst, J.P., & Van Vugt, M.K. (2018). Characterizing synchrony patterns across cognitive task stages of associative recognition memory. European Journal of Neuroscience 48, pp. 2759-2769. link
Zhang, Q., van Vugt, M.K., Borst, J.P., & Anderson, J.R. (2018). Mapping Working Memory Retrieval in Space and in Time: A Combined Electroencephalography and Electrocorticography Approach. NeuroImage 174, pp. 472-484. link

hmp's People

Contributors

Stargazers

Watchers

Forkers

jokra1 lvmaanen dmkhitaryan johnatl minyan998 vivile42 rickdott le-clemo

hmp's Issues

Simplify plot_topo

With the code written for the bootstrap analysis by Bruno

Ease/automate condition wise estimation

New feature: hmp.visu.plot_components_sensor()

Introducing a new feature that allows plotting the channel contribution to each retained PC.

This is made possible by storing the PCA decomposition matrix as an attribute to the result from hmp.utils.transformed_data

hmp_data.attrs['components']
The new function then uses mne's mne.viz.plot_topomap to plot the channel contribution using the stored attribute and the info or position object as used for hmp.visu.plot_topo_timecours()

Inconsistencies in reconstruct function

To account for how matlab does the PCA and therefore the reconstruction, I divide the PCs by the explained variance. Therefore all PCs have the same variance. This might be good for the estimation (is it?) but this produces dubious results when reconstructing the EEG activity after.
There must be a better way to do the PCA and the reconstruction

Using hsmm.backward_estimation() as starting point for LOOCV might be biased

Possible problem:
The estimates from hsmm.backward_estimation() are based on the data from all subjects and thus contain information about the data that is omitted purposefully during LOOCV (i.e., the held-out data from a single subject). Since the estimate recovered by the EM, also used during the LOOCV, depends on the starting value the estimate might also end up being biased towards the solution that worked for all subjects.

Proposed solution:
Move the backward estimation into the LOOCV procedure and calculate a backward estimate for each fold (i.e. the remaining data after excluding one subject). Then use those for the remainder of the LOOCV procedure.

This does appear to affect the results from tutorial 3:

Here the original approach:

And here the proposed alternative:

Issue during transformation of data for HMP with 2+ participants

Error: Running the hmp.utils.transform_data method for more than one subject requires the use of apply_standard=True. This argument, however, leads to an error when combined with egg_data.data.

Solution: in this case, instead of running eeg_data.data, for now run eeg_data (i.e., the whole xarray instead of just data variable).

Constant added to eventprob?

In current simulations, it looks like the mean eventprob is slightly shifted.

Possibility for only one stage to be estimated

Useful feature would be to selectively let only one stage vary with other latencies and all magnitudes fixed

Re-organizing the cross-correlated values is now done at the init phase instead of repeatedly on each EM() call

Current behavior:
At each call of EM() the crosscorrelated data (all_samples X n_dimensions) in self.event is re-organized into a by-trial matrix (max_samples X n_trials X n_dimensions).

Problem: Given that the cross-correlated data is not susceptible to change once hmp.models.hmp() is initialized, the most efficient way is to do it only once at the init call.

Consequence: speed up is very modest for single call of EM() but function making repeated calls (hmp.fit(), hmp.sliding_event, hmp.backward_estimation) the speed up can be as high as x2

possible problems:

If crosscorrelated data is changed after init by modifying self.events, it will not be taken into account but this is not a behavior that should be done.
size of the initialized model increases but shouldn't be a problem given usual data size

Will be shipped with issue #61

Moving the data feed to HsMM from initial stacked data + starts and ends to more xarray compatible data format

Probably more efficient in terms of speed and RAM usage

Low-pass filtering of the data is not necessary

The cross-correlation between the template and the data already acts as a low-pass filter so implementing one (as default for now) is unnecessary. Future default behavior will

Change default behavior in read_mne_EEG
Adapt tutorial 1

Xlabel crashing plot_topomap function

hsmm.plot_LOOCV(loocv)

Yields

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-14-afb38d636416> in <module>
      3 #plt.xlabel('Number of bumps');
      4 
----> 5 hsmm.plot_LOOCV(loocv_covert)

D:\Documents\School\FYRP\eeg_analysis\pyhsmm_mvpa.py in plot_LOOCV(loocv_estimates, pval, figsize)
    313             ax[0].text(x=n_bump-.5, y=np.mean(loocv_estimates.sel(n_bump=n_bump).data), s=str(np.sum(diff_bin[-1]))+'/'+str(len(diffs[-1]))+':'+str(np.around(pvals[-1][-1],2)))
    314     ax[1].plot(diffs,'.-', alpha=.3)
--> 315     ax[1].set_xticks(ticks=np.arange(0,loocv_estimates.n_bump.max()-1), labels=labels)
    316     ax[1].hlines(0,0,len(np.arange(2,loocv_estimates.n_bump.max())),color='k')
    317     ax[1].set_ylabel('Change in likelihood')

D:\Anaconda\lib\site-packages\matplotlib\axes\_base.py in wrapper(self, *args, **kwargs)
     61 
     62         def wrapper(self, *args, **kwargs):
---> 63             return get_method(self)(*args, **kwargs)
     64 
     65         wrapper.__module__ = owner.__module__

D:\Anaconda\lib\site-packages\matplotlib\cbook\deprecation.py in wrapper(*args, **kwargs)
    449                 "parameter will become keyword-only %(removal)s.",
    450                 name=name, obj_type=f"parameter of {func.__name__}()")
--> 451         return func(*args, **kwargs)
    452 
    453     return wrapper

TypeError: set_ticks() got an unexpected keyword argument 'labels'

Write the function/example for condition-wise LOOCV and update tutorial 4

Multiprocessing LOOCV mode illustrated in tutorial 3 should be implemented as a function to ease its usage

Durations and mean_d are based on max durations per subject instead of mean

At least when fitting multiple subjects, mean_d and durations of the hmp model are based on the max duration (ie number of samples in the matrix) of each subject. They should be based on the mean of each subject.

This might be caused by that data.unstack() on line 60 of models.py gives the same number of epochs as there are subjects.

Implement 0-bump model likelihood estimation

The weights of the PCA on electrodes should be stored in the model

Can be done using a new DataArray within the Dataset created with transform_data

Implement alternative EEGLAB color code to ease comparison with previous applications

Add possibility to not test the LOOCV

In hsmm.visu.plot_loocv() dd the option to apply neither sign test nor t-test and just display the number of participants who improved.

Extracting x,y electrode positions from MNE montage is not to be trusted for now

For now regarding elecrode positions the best is to load in MNE a raw or epoched EEG data and pass the raw.info to the function plot_topo_timecourse() as extracting the x,y, location of electrodes from an MNE montage fails (see https://mne.discourse.group/t/problem-extracting-x-y-position-from-montage-to-use-with-plot-topomap/5849).

Channel location extracted from EEGLAb seem to work though (see tutorial 1 section 3)

Implementing a minimum number of iterations in the Expectation maximization algorithm.

Current behavior: the EM stops as soon as the likelihood from a new iteration is lower than the previous iteration (while lkh - lkh_prev > threshold):

        while lkh - lkh_prev > threshold and i < max_iteration:#Expectation-Maximization algorithm
            #As long as new run gives better likelihood, go on

Problem: with bad starting points of $\theta$ and scarce data it can happen that the while loop is broken without having reached convergence (stablization of the loglikelihood).

Solution: A minimum number of iterations fixes this behavior
Example with data from tutorial 2

Proposed implementation in branch #Improving-EM-estimation
When calling hmp.models.hmp() a minimum number of EM iterations can be specified.

        while i < max_iteration :#Expectation-Maximization algorithm
            if i > min_iteration and lkh - lkh_prev < threshold:
                break

The default is 10 and this can be overriden for specific fits when calling the EM() function if needed (e.g. the last step of the fit() function where further iterations usually decrease the Loglikelihood).
Additionally every fitted object has now a new data variable em_traces that stores the value of the loglikelihood at each iteration. This can then be used to check convergence of the EM.

Resulting behavior:
Computing time of most functions remain the same for larger datasets as those would exceed the minimum number of iterations anyway. For this reason the results also stay the same.
For smaller datasets however, several method including the fit() function and the sliding_event() are now notoriously slower (but better) because each tested starting point is susceptible to actually converge, hence to take longer.

Provide an example on how to use event times as annotations/triggers in MNE

TQDM uses notebook context by default, shoulb be adapted to users environment

the default behavior implemented in the fit() function of the hmp class yields the following error when executed outside of a notebook context:

TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html
from .autonotebook import tqdm as notebook_tqdm```

Thresholding the epochs using mne's autoreject

Using a common threshold for all participants is sub-optimal, would be nice to give the possibility to use auto-rejection with 'auto' for the reject parameter: https://autoreject.github.io/stable/index.html

xarray's need for equal dimension size generates some specific adjustments in the recovery of the results in the code

Might be worth it to move to a simpler numpy structure but depends on the benefits in the final analysis workflow

Threshold argument in utils.bootstrapping

In utils.bootstrapping, the threshold argument passed through the function call is ignored, since in line 807, it’s being forcefully set to 1 regardless, as shown in the image below (same applies to cpus):

Moving data loading to dask to speed up and reduce RAM usage

Using dask as integrated in xarray by opening data with open_mfdataset dramatically speed things up

See:

    xr.open_dataset('../epoch_data_condition-wise_100Hz.nc').mean()

    CPU times: user 2.61 s, sys: 3.92 s, total: 6.53 s
    Wall time: 6.63 s

Compared to:

    xr.open_mfdataset('../epoch_data_condition-wise_100Hz.nc').mean()

    CPU times: user 4.82 ms, sys: 251 µs, total: 5.07 ms
    Wall time: 4.33 ms

Using this method does however not work with the current implementation of computation tasks

Random generation of starting points yields RuntimeWarning because of starting points larger than the mean RT

Title explicit. For now, it is dealt with ignoring the warning as the error is not too bad. This however probably points to the fact that random generation of starting points is yet inefficient.

Ways to reproduce

init = hmp.models.hmp(hmp_data, sfreq=eeg_data.sfreq, bump_width=50, cpus=cpus)#Initialization of the model
selected = init.fit_single(number_of_sources-1, method='random', starting_points=10)#function to fit an instance of a 4 bumps model

Contribution of the channels to the transition events (magnitudes) are now computed on the maximum probability at each trial

Up to now the contribution of each channel to the transition event is computed as follows:

for comp in range(self.n_dims):
    magnitudes[event,comp] = np.mean(np.sum( \
        eventprobs[:,:,event]*self.data_matrix[:,:,comp], axis=0))
# Scale cross-correlation with likelihood of the transition
# sum by-trial these scaled activation for each transition events
# average across trials

But averaging the probabilities in this way :

slows down the estimation process as the mean position is more influenced by the gamma distribution
forces us to create a new data representation (see Issue #62 )
Based on simulations seems to actually misestimate the event timings (see last point)

A new proposition is to base the computation of the contributions to the max probabilities

if self.em_method == "max":
    #Take time point at maximum p() for each trial
    #Average channel activity at those points
    event_values = np.zeros((self.n_trials, self.n_dims))
    for trial in range(self.n_trials):
        time = self.starts[trial]+ np.argmax(eventprobs[:, trial, event])
        event_values[trial] = self.events[time]
    magnitudes[event] = np.mean(event_values, axis=0)

Latest commits in Branch #Improving-EM-estimation by default uses the max method as it seems the best both in terms of correctness and speed:

Selecting conditions without string matching

Likely a corner case but given that the subsetting of the data to a condition is done by string matching, several conditions might be picked up if they share the string given as a test (e.g. dependent and independent both match dependent)

Add data loading to the example in Tutorial 1 + saving of HMP transformed data

In this tutorial we show how to save data but not how to load. It would be better to have it there also in addition to tutorial 3 and 4.

Also a saving method for the HMP transformed data needs to be implemented and added to the tutorials

Implement multiprocessing in the bootstrapping function

For now bootstrapping is unreasonably slow for large datasets and a lot of iterations. The only possible solution I see is to implement multiprocessing as done in e.g. loocv()

Fit object should be a xr.Dataset and contain estimates as separate xr.DataArray

Generating raw data and fitting from scratch

Directly generating raw EEG data (with known source) will be useful to also assess upper-stream processing such as data reduction and pre-processing.
MNE has some examples, I will be starting from those.

Slow execution time of compute_topo

Function compute_topo() in utils is very slow, parallel processing, for loops replacement should be considered.

Add the stepwise back-roll estimation of bumps

fit() function fails to find events when they are neighboring a large event

In simulated data at least, generating events can lead to situations where events are missed.

#Simulation parameters (see tutorial 2 for complete cell)
# Randomly specify the transition events
n_events = 5
name_sources = np.random.choice(all_source_names,n_events+1, replace=False)#randomly pick source without replacement
# Generated sources = array(['middletemporal-lh', 'caudalmiddlefrontal-rh',
#'caudalanteriorcingulate-rh', 'lateraloccipital-lh',
# 'parstriangularis-lh', 'temporalpole-rh'], dtype='<U27')

times = np.random.uniform(25,300,n_events+1)/shape#randomly pick average times in millisecond between the events
#Generated : array([18.25129272, 18.15805496,  9.26716797, 16.61963181, 15.8271108 , 30.23900551])

When using the fit_single() function with the number of events generated, the correct solution is found without difficulties

selected = init.fit_single(number_of_sources-1)#function to fit an instance of a 10 events model
hmp.visu.plot_topo_timecourse(eeg_dat, selected, positions, init, magnify=1, sensors=False,
                                times_to_display = np.mean(np.cumsum(random_source_times,axis=1),axis=0))

But running the fit function fails to find the correct number of events despite relatively high signal to noise ratio

estimates, traces = init.fit(threshold=1, verbose=True, trace=True)
hmp.visu.plot_topo_timecourse(eeg_dat, estimates, positions, init, 
                                times_to_display = np.mean(np.cumsum(random_source_times,axis=1),axis=0))

Analyzing the trace of the fit() algorithm shows that all the samples between first and last bump is modelled as an event similar to the first one

hmp.visu.plot_iterations(traces, eeg_dat, init, positions, ['magnitudes_proposed','parameters_proposed'])

...

Looking at the event_sliding() function when given the true channel contribution shows that the third event is likely missed because the likelihood is too low compared to the others (to fix), but why the second one is missed is yet unknown.

init = hmp.models.hmp(hmp_dat, sfreq=eeg_dat.sfreq, event_width=50, cpus=cpus, em_method='max')#Initialization of the model
default_colors =  ['cornflowerblue','indianred','orange','darkblue','darkgreen','gold', 'brown']
_, ax = plt.subplots(figsize=(12,3), dpi=300)
init.sliding_event(show=False, ax=ax)
for event in range(n_events):
    times = [int(x) for x in init.starts+np.sum(random_source_times[:,:event+1], axis=1)+1]
    true_mags = np.median(init.events[times], axis=0)
    init.sliding_event(magnitudes=true_mags, color=default_colors[event], ax=ax)
plt.vlines(np.mean(np.cumsum(random_source_times,axis=1),axis=0), -250,600, colors=default_colors[:n_events])
plt.ylim(-250,800)

backward_estimation causes error when called

Using the function backward_estimation causes an error

KeyError: "not all values found in index 'event'"

Not sure yet where this is coming from, probably related to #38.

Short-term solution is to decreases the number of events fitted with the max_bump parameter

Check PCA sign inversion and effect on bump location

Guess that is not an issue but that should be checked anyway

Testing needed for the new ways of zscoring

PR #54 introduced a new default way of scoring the data. The previous behavior of the hmp.transform_data() function was to zscore on a by trial basis, the new default way zscore the data by-participant, a third way is to score all the data at once.

It is yet unclear what the best default option is. Further testing is needed to validate the default behavior

Add possibility to estimate two different gammas across conditions

Code already contains the possibility of not estimating magnitudes (or parameters)

Write documentation to visu module

New version coming in January, Install from pip instead of github

Dear all,

Some exciting improvements will come in January! For now the version 0.1.0 is half-backed on github so not fully working yet. Therefore please only use the latest version on pip and see you soon after the holiday season!

Single core support for hsmm.backward_estimation()

Hey Gabriel,

Two reasons for this issue:

Reason 1:
In case we want to parallelize over subjects in the adjusted LOOCV as discussed in #34 we need support for:

init = hsmm.models.hsmm(hsmm_data, sf=eeg_data.sfreq, bump_width=50, cpus=1)
fit= init.backward_estimation(max_starting_points=1)

which would be called in the proposed __loocv_backwards() function but currently results in a ValueError: For loop not yet written use cpus >1 (which I think should actually be ValueError: For loop not yet written use cpus < 2 :-)). I implemented this in caa5930 if you want to take a look.

Reason 2:

Based on the implementation in caa5930, there appears to be no benefit of using multiple cores for this one on the data from tutorial 3. Calling:

init = hsmm.models.hsmm(hsmm_data, sf=eeg_data.sfreq, bump_width=50, cpus=4)
fit= init.backward_estimation(max_starting_points=1)

reports a wall time of 14.2 seconds on my machine while calling

init = hsmm.models.hsmm(hsmm_data, sf=eeg_data.sfreq, bump_width=50, cpus=1)
fit= init.backward_estimation(max_starting_points=1)

reports a wall time of 13.9 seconds. So the cost of spawning more processes appears to actually be higher than the gain of splitting the work. This will probably depend on a lot of factors (number of trials probably?) but I think it is another good reason why this should be added.

I did not open a PR though because there is this mp parameter in the fit_single() function that I do not fully understand. I assumed that I would have to set that one to False in my implementation but it only worked with it being set to True. I was not sure whether this is worth another issue, but maybe you can explain this parameter to me. :-)

Iterative fit fails for the expected max number of bumps

with multiprocessing.Pool(processes=cpus) as pool:
    results = pool.map(init.fit_single, np.arange(1,9))

yields

/home/gweindel/owncloud/projects/RUGUU/pyhsmm-mvpa/pyhsmm_mvpa.py:283: RuntimeWarning: divide by zero encountered in log
  #normalization [-1, 1] divide each trial and state by the sum of the n points in a trial
/home/gweindel/owncloud/projects/RUGUU/pyhsmm-mvpa/pyhsmm_mvpa.py:285: RuntimeWarning: invalid value encountered in true_divide
  
/home/gweindel/owncloud/projects/RUGUU/pyhsmm-mvpa/pyhsmm_mvpa.py:170: RuntimeWarning: invalid value encountered in double_scalars
  magnitudes1 = np.copy(magnitudes)

then

File ~/owncloud/projects/RUGUU/pyhsmm-mvpa/pyhsmm_mvpa.py:194, in __fit()
    192                 parameters[i,:] = parameters1[i,:]
    193         lkh, eventprobs = self.calc_EEG_50h(magnitudes,parameters,n_bumps)
--> 194 return lkh1,magnitudes1,parameters1,eventprobs1

UnboundLocalError: local variable 'magnitudes1' referenced before assignment

Use eventprobs and gamma_parameters to easily extract bump times

Use the methods already created to make a one-liner on time extraction

Removing mp parameter in single_fit()

Behavior:
mp is a bool parameter in function models.single_fit() and only transpose magnitudes matrix when multiple models are estimated:

        if mp==True: #PCG: Dirty temporarilly needed for multiprocessing in the iterative backroll estimation...
            magnitudes = magnitudes.T

Solution
remove mp as it is hard to understand (e.g. #35) by appropriately feeding magnitudes to models.single_fit() in case of multiple models