Code Monkey home page Code Monkey logo

hmp's Introduction

Twitter Follow GitHub followers CV

Welcome to my Github profile, my name is Gabriel Weindel (he/him) and I'm a cognitive scientist. Currently I work as a postdoc both at the bernoulli institute for mathematics and computer science of the University of Groningen, an in the experimental psychology division of Utrecht University.

I am mainly interested in nerdy stuff about computational models of cognition (especially decision making), electrophysiology and Bayesian statistics.

To learn more check out my website

hmp's People

Contributors

dmkhitaryan avatar gweindel avatar jelmerborst avatar rickdott avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

hmp's Issues

Xlabel crashing plot_topomap function

hsmm.plot_LOOCV(loocv)

Yields

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-14-afb38d636416> in <module>
      3 #plt.xlabel('Number of bumps');
      4 
----> 5 hsmm.plot_LOOCV(loocv_covert)

D:\Documents\School\FYRP\eeg_analysis\pyhsmm_mvpa.py in plot_LOOCV(loocv_estimates, pval, figsize)
    313             ax[0].text(x=n_bump-.5, y=np.mean(loocv_estimates.sel(n_bump=n_bump).data), s=str(np.sum(diff_bin[-1]))+'/'+str(len(diffs[-1]))+':'+str(np.around(pvals[-1][-1],2)))
    314     ax[1].plot(diffs,'.-', alpha=.3)
--> 315     ax[1].set_xticks(ticks=np.arange(0,loocv_estimates.n_bump.max()-1), labels=labels)
    316     ax[1].hlines(0,0,len(np.arange(2,loocv_estimates.n_bump.max())),color='k')
    317     ax[1].set_ylabel('Change in likelihood')

D:\Anaconda\lib\site-packages\matplotlib\axes\_base.py in wrapper(self, *args, **kwargs)
     61 
     62         def wrapper(self, *args, **kwargs):
---> 63             return get_method(self)(*args, **kwargs)
     64 
     65         wrapper.__module__ = owner.__module__

D:\Anaconda\lib\site-packages\matplotlib\cbook\deprecation.py in wrapper(*args, **kwargs)
    449                 "parameter will become keyword-only %(removal)s.",
    450                 name=name, obj_type=f"parameter of {func.__name__}()")
--> 451         return func(*args, **kwargs)
    452 
    453     return wrapper

TypeError: set_ticks() got an unexpected keyword argument 'labels'


Testing needed for the new ways of zscoring

PR #54 introduced a new default way of scoring the data. The previous behavior of the hmp.transform_data() function was to zscore on a by trial basis, the new default way zscore the data by-participant, a third way is to score all the data at once.

It is yet unclear what the best default option is. Further testing is needed to validate the default behavior

Generating raw data and fitting from scratch

Directly generating raw EEG data (with known source) will be useful to also assess upper-stream processing such as data reduction and pre-processing.
MNE has some examples, I will be starting from those.

New feature: hmp.visu.plot_components_sensor()

Introducing a new feature that allows plotting the channel contribution to each retained PC.

This is made possible by storing the PCA decomposition matrix as an attribute to the result from hmp.utils.transformed_data

hmp_data.attrs['components']
The new function then uses mne's mne.viz.plot_topomap to plot the channel contribution using the stored attribute and the info or position object as used for hmp.visu.plot_topo_timecours()

image

Implementing a minimum number of iterations in the Expectation maximization algorithm.

Current behavior: the EM stops as soon as the likelihood from a new iteration is lower than the previous iteration (while lkh - lkh_prev > threshold):

        while lkh - lkh_prev > threshold and i < max_iteration:#Expectation-Maximization algorithm
            #As long as new run gives better likelihood, go on 

Problem: with bad starting points of $\theta$ and scarce data it can happen that the while loop is broken without having reached convergence (stablization of the loglikelihood).

Solution: A minimum number of iterations fixes this behavior
Example with data from tutorial 2
image
Proposed implementation in branch #Improving-EM-estimation
When calling hmp.models.hmp() a minimum number of EM iterations can be specified.

        while i < max_iteration :#Expectation-Maximization algorithm
            if i > min_iteration and lkh - lkh_prev < threshold:
                break

The default is 10 and this can be overriden for specific fits when calling the EM() function if needed (e.g. the last step of the fit() function where further iterations usually decrease the Loglikelihood).
Additionally every fitted object has now a new data variable em_traces that stores the value of the loglikelihood at each iteration. This can then be used to check convergence of the EM.

Resulting behavior:
Computing time of most functions remain the same for larger datasets as those would exceed the minimum number of iterations anyway. For this reason the results also stay the same.
For smaller datasets however, several method including the fit() function and the sliding_event() are now notoriously slower (but better) because each tested starting point is susceptible to actually converge, hence to take longer.

Low-pass filtering of the data is not necessary

The cross-correlation between the template and the data already acts as a low-pass filter so implementing one (as default for now) is unnecessary. Future default behavior will
Pasted image 20230317130322

  • Change default behavior in read_mne_EEG
  • Adapt tutorial 1

Removing mp parameter in single_fit()

Behavior:
mp is a bool parameter in function models.single_fit() and only transpose magnitudes matrix when multiple models are estimated:

        if mp==True: #PCG: Dirty temporarilly needed for multiprocessing in the iterative backroll estimation...
            magnitudes = magnitudes.T

Solution
remove mp as it is hard to understand (e.g. #35) by appropriately feeding magnitudes to models.single_fit() in case of multiple models

fit() function fails to find events when they are neighboring a large event

In simulated data at least, generating events can lead to situations where events are missed.

#Simulation parameters (see tutorial 2 for complete cell)
# Randomly specify the transition events
n_events = 5
name_sources = np.random.choice(all_source_names,n_events+1, replace=False)#randomly pick source without replacement
# Generated sources = array(['middletemporal-lh', 'caudalmiddlefrontal-rh',
#'caudalanteriorcingulate-rh', 'lateraloccipital-lh',
# 'parstriangularis-lh', 'temporalpole-rh'], dtype='<U27')

times = np.random.uniform(25,300,n_events+1)/shape#randomly pick average times in millisecond between the events
#Generated : array([18.25129272, 18.15805496,  9.26716797, 16.61963181, 15.8271108 , 30.23900551])

When using the fit_single() function with the number of events generated, the correct solution is found without difficulties

selected = init.fit_single(number_of_sources-1)#function to fit an instance of a 10 events model
hmp.visu.plot_topo_timecourse(eeg_dat, selected, positions, init, magnify=1, sensors=False,
                                times_to_display = np.mean(np.cumsum(random_source_times,axis=1),axis=0))

image

But running the fit function fails to find the correct number of events despite relatively high signal to noise ratio

estimates, traces = init.fit(threshold=1, verbose=True, trace=True)
hmp.visu.plot_topo_timecourse(eeg_dat, estimates, positions, init, 
                                times_to_display = np.mean(np.cumsum(random_source_times,axis=1),axis=0))

image

Analyzing the trace of the fit() algorithm shows that all the samples between first and last bump is modelled as an event similar to the first one

hmp.visu.plot_iterations(traces, eeg_dat, init, positions, ['magnitudes_proposed','parameters_proposed'])

image
...
image
image

Looking at the event_sliding() function when given the true channel contribution shows that the third event is likely missed because the likelihood is too low compared to the others (to fix), but why the second one is missed is yet unknown.

init = hmp.models.hmp(hmp_dat, sfreq=eeg_dat.sfreq, event_width=50, cpus=cpus, em_method='max')#Initialization of the model
default_colors =  ['cornflowerblue','indianred','orange','darkblue','darkgreen','gold', 'brown']
_, ax = plt.subplots(figsize=(12,3), dpi=300)
init.sliding_event(show=False, ax=ax)
for event in range(n_events):
    times = [int(x) for x in init.starts+np.sum(random_source_times[:,:event+1], axis=1)+1]
    true_mags = np.median(init.events[times], axis=0)
    init.sliding_event(magnitudes=true_mags, color=default_colors[event], ax=ax)
plt.vlines(np.mean(np.cumsum(random_source_times,axis=1),axis=0), -250,600, colors=default_colors[:n_events])
plt.ylim(-250,800)

image

TQDM uses notebook context by default, shoulb be adapted to users environment

the default behavior implemented in the fit() function of the hmp class yields the following error when executed outside of a notebook context:

TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html
from .autonotebook import tqdm as notebook_tqdm```

Re-organizing the cross-correlated values is now done at the init phase instead of repeatedly on each EM() call

Current behavior:
At each call of EM() the crosscorrelated data (all_samples X n_dimensions) in self.event is re-organized into a by-trial matrix (max_samples X n_trials X n_dimensions).

Problem: Given that the cross-correlated data is not susceptible to change once hmp.models.hmp() is initialized, the most efficient way is to do it only once at the init call.

Consequence: speed up is very modest for single call of EM() but function making repeated calls (hmp.fit(), hmp.sliding_event, hmp.backward_estimation) the speed up can be as high as x2

possible problems:

  • If crosscorrelated data is changed after init by modifying self.events, it will not be taken into account but this is not a behavior that should be done.
  • size of the initialized model increases but shouldn't be a problem given usual data size

Will be shipped with issue #61

backward_estimation causes error when called

Using the function backward_estimation causes an error

KeyError: "not all values found in index 'event'"

Not sure yet where this is coming from, probably related to #38.

Short-term solution is to decreases the number of events fitted with the max_bump parameter

Inconsistencies in reconstruct function

To account for how matlab does the PCA and therefore the reconstruction, I divide the PCs by the explained variance. Therefore all PCs have the same variance. This might be good for the estimation (is it?) but this produces dubious results when reconstructing the EEG activity after.
There must be a better way to do the PCA and the reconstruction

Durations and mean_d are based on max durations per subject instead of mean

At least when fitting multiple subjects, mean_d and durations of the hmp model are based on the max duration (ie number of samples in the matrix) of each subject. They should be based on the mean of each subject.

This might be caused by that data.unstack() on line 60 of models.py gives the same number of epochs as there are subjects.

Moving data loading to dask to speed up and reduce RAM usage

Using dask as integrated in xarray by opening data with open_mfdataset dramatically speed things up

See:

    xr.open_dataset('../epoch_data_condition-wise_100Hz.nc').mean()
    CPU times: user 2.61 s, sys: 3.92 s, total: 6.53 s
    Wall time: 6.63 s

Compared to:

    xr.open_mfdataset('../epoch_data_condition-wise_100Hz.nc').mean()
    CPU times: user 4.82 ms, sys: 251 µs, total: 5.07 ms
    Wall time: 4.33 ms

Using this method does however not work with the current implementation of computation tasks

Extracting x,y electrode positions from MNE montage is not to be trusted for now

For now regarding elecrode positions the best is to load in MNE a raw or epoched EEG data and pass the raw.info to the function plot_topo_timecourse() as extracting the x,y, location of electrodes from an MNE montage fails (see https://mne.discourse.group/t/problem-extracting-x-y-position-from-montage-to-use-with-plot-topomap/5849).

Channel location extracted from EEGLAb seem to work though (see tutorial 1 section 3)

Single core support for hsmm.backward_estimation()

Hey Gabriel,

Two reasons for this issue:

Reason 1:
In case we want to parallelize over subjects in the adjusted LOOCV as discussed in #34 we need support for:

init = hsmm.models.hsmm(hsmm_data, sf=eeg_data.sfreq, bump_width=50, cpus=1)
fit= init.backward_estimation(max_starting_points=1)

which would be called in the proposed __loocv_backwards() function but currently results in a ValueError: For loop not yet written use cpus >1 (which I think should actually be ValueError: For loop not yet written use cpus < 2 :-)). I implemented this in caa5930 if you want to take a look.

Reason 2:

Based on the implementation in caa5930, there appears to be no benefit of using multiple cores for this one on the data from tutorial 3. Calling:

init = hsmm.models.hsmm(hsmm_data, sf=eeg_data.sfreq, bump_width=50, cpus=4)
fit= init.backward_estimation(max_starting_points=1)

reports a wall time of 14.2 seconds on my machine while calling

init = hsmm.models.hsmm(hsmm_data, sf=eeg_data.sfreq, bump_width=50, cpus=1)
fit= init.backward_estimation(max_starting_points=1)

reports a wall time of 13.9 seconds. So the cost of spawning more processes appears to actually be higher than the gain of splitting the work. This will probably depend on a lot of factors (number of trials probably?) but I think it is another good reason why this should be added.

I did not open a PR though because there is this mp parameter in the fit_single() function that I do not fully understand. I assumed that I would have to set that one to False in my implementation but it only worked with it being set to True. I was not sure whether this is worth another issue, but maybe you can explain this parameter to me. :-)

Iterative fit fails for the expected max number of bumps

with multiprocessing.Pool(processes=cpus) as pool:
    results = pool.map(init.fit_single, np.arange(1,9))

yields

/home/gweindel/owncloud/projects/RUGUU/pyhsmm-mvpa/pyhsmm_mvpa.py:283: RuntimeWarning: divide by zero encountered in log
  #normalization [-1, 1] divide each trial and state by the sum of the n points in a trial
/home/gweindel/owncloud/projects/RUGUU/pyhsmm-mvpa/pyhsmm_mvpa.py:285: RuntimeWarning: invalid value encountered in true_divide
  
/home/gweindel/owncloud/projects/RUGUU/pyhsmm-mvpa/pyhsmm_mvpa.py:170: RuntimeWarning: invalid value encountered in double_scalars
  magnitudes1 = np.copy(magnitudes)

then

File ~/owncloud/projects/RUGUU/pyhsmm-mvpa/pyhsmm_mvpa.py:194, in __fit()
    192                 parameters[i,:] = parameters1[i,:]
    193         lkh, eventprobs = self.calc_EEG_50h(magnitudes,parameters,n_bumps)
--> 194 return lkh1,magnitudes1,parameters1,eventprobs1

UnboundLocalError: local variable 'magnitudes1' referenced before assignment

Using hsmm.backward_estimation() as starting point for LOOCV might be biased

Possible problem:
The estimates from hsmm.backward_estimation() are based on the data from all subjects and thus contain information about the data that is omitted purposefully during LOOCV (i.e., the held-out data from a single subject). Since the estimate recovered by the EM, also used during the LOOCV, depends on the starting value the estimate might also end up being biased towards the solution that worked for all subjects.

Proposed solution:
Move the backward estimation into the LOOCV procedure and calculate a backward estimate for each fold (i.e. the remaining data after excluding one subject). Then use those for the remainder of the LOOCV procedure.

This does appear to affect the results from tutorial 3:

Here the original approach:
loocv using original approach

And here the proposed alternative:
loocv using proposed solution

Selecting conditions without string matching

Likely a corner case but given that the subsetting of the data to a condition is done by string matching, several conditions might be picked up if they share the string given as a test (e.g. dependent and independent both match dependent)

Contribution of the channels to the transition events (magnitudes) are now computed on the maximum probability at each trial

Up to now the contribution of each channel to the transition event is computed as follows:

for comp in range(self.n_dims):
    magnitudes[event,comp] = np.mean(np.sum( \
        eventprobs[:,:,event]*self.data_matrix[:,:,comp], axis=0))
# Scale cross-correlation with likelihood of the transition
# sum by-trial these scaled activation for each transition events
# average across trials

But averaging the probabilities in this way :

  1. slows down the estimation process as the mean position is more influenced by the gamma distribution
  2. forces us to create a new data representation (see Issue #62 )
  3. Based on simulations seems to actually misestimate the event timings (see last point)

A new proposition is to base the computation of the contributions to the max probabilities

if self.em_method == "max":
    #Take time point at maximum p() for each trial
    #Average channel activity at those points
    event_values = np.zeros((self.n_trials, self.n_dims))
    for trial in range(self.n_trials):
        time = self.starts[trial]+ np.argmax(eventprobs[:, trial, event])
        event_values[trial] = self.events[time]
    magnitudes[event] = np.mean(event_values, axis=0)

Latest commits in Branch #Improving-EM-estimation by default uses the max method as it seems the best both in terms of correctness and speed:

image
image

Random generation of starting points yields RuntimeWarning because of starting points larger than the mean RT

Title explicit. For now, it is dealt with ignoring the warning as the error is not too bad. This however probably points to the fact that random generation of starting points is yet inefficient.

Ways to reproduce

init = hmp.models.hmp(hmp_data, sfreq=eeg_data.sfreq, bump_width=50, cpus=cpus)#Initialization of the model
selected = init.fit_single(number_of_sources-1, method='random', starting_points=10)#function to fit an instance of a 4 bumps model

Issue during transformation of data for HMP with 2+ participants

Error: Running the hmp.utils.transform_data method for more than one subject requires the use of apply_standard=True. This argument, however, leads to an error when combined with egg_data.data.

Solution: in this case, instead of running eeg_data.data, for now run eeg_data (i.e., the whole xarray instead of just data variable).

Threshold argument in utils.bootstrapping

In utils.bootstrapping, the threshold argument passed through the function call is ignored, since in line 807, it’s being forcefully set to 1 regardless, as shown in the image below (same applies to cpus):
image

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.