pennmem / cmlreaders Goto Github PK
View Code? Open in Web Editor NEWCML data reading made easier...
Home Page: https://pennmem.github.io/cmlreaders/html/index.html
CML data reading made easier...
Home Page: https://pennmem.github.io/cmlreaders/html/index.html
I think it's about time for a new release (mainly for a new conda package for easier feedback from users). Things to do first:
Since we've added quite a bit of functionality, should we call this version 0.4?
For incremented montage numbers, the subject code appears as (e.g.) R1006P_1
in pairs.json
. This results in a key error when reading this data.
Generally, we want to use CMLReader
to interact with specific readers indirectly. But there are some use cases where other readers are useful, such as the one for reading Ramulator event log files (this is helpful for me when analyzing output from tests). At present, it's a bit awkward to use:
reader = RamulatorEventLogReader("experiment_log", "fake subject", "fake experiment", 0, file_path="event_log.json")
events = reader.as_dataframe()
All that I really need to pass in this case is that path to the file, but the constructor requires subject, experiment, and session. I propose we add a new class method which works like this:
events = RamulatorEventLogReader.fromfile("event_log.json")
This has the advantage of not making the constructor overly complicated (e.g., arguments are optional if file_path
is not specified, otherwise they are required).
It's annoying to have to remember to always specify the rootdir
keyword argument. We should instead allow a RHINO_ROOT
environment variable to be defined which is used by default unless rootdir
is specified to override (/
is still default if neither is done).
The "load_eeg" method of CMLReader should handle multi-session EEG loading better: If a user has passed events from multiple sessions, the reader object should not look for (or require) the "session" argument when it was instantiated. All the information it needs to load the EEG is in the events variable itself (which I think is kind of the whole point).
Right now, you get an error if you try to run the following code:
from cmlreaders import CMLReader, get_data_index
df = get_data_index("r1")
s = 'R1111M'
exp = 'FR1'
sessions = df[np.logical_and(df["subject"] == s, df['experiment']==exp)]['session'].unique()
#Load events from all sessions
events = pd.concat([
CMLReader(s, exp, session).load("events")
for session in sessions
])
#Just get word events
word_events = events[events.type=='WORD']
#Get EEG
reader = CMLReader(s, exp)
pairs = CMLReader(s, exp).load("pairs")
eeg = reader.load_eeg(events=word_events, rel_start=-100, rel_stop=1700, scheme=pairs)
---------------------------------------------------------------------------
FileNotFoundError Traceback (most recent call last)
<ipython-input-62-37e9cfb3b0f4> in <module>()
18 reader = CMLReader(s, exp)
19 pairs = CMLReader(s, exp).load("pairs")
---> 20 eeg = reader.load_eeg(events=word_events, rel_start=-100, rel_stop=1700, scheme=pairs)
~/anaconda3/envs/CML/lib/python3.6/site-packages/cmlreaders-0.7.1-py3.6.egg/cmlreaders/cmlreader.py in load_eeg(self, events, rel_start, rel_stop, epochs, scheme)
268 })
269
--> 270 return self.load('eeg', **kwargs)
~/anaconda3/envs/CML/lib/python3.6/site-packages/cmlreaders-0.7.1-py3.6.egg/cmlreaders/cmlreader.py in load(self, data_type, file_path, **kwargs)
198 montage=self.montage,
199 file_path=file_path,
--> 200 rootdir=self.rootdir).load(**kwargs)
201
202 def load_eeg(self, events: Optional[pd.DataFrame] = None,
~/anaconda3/envs/CML/lib/python3.6/site-packages/cmlreaders-0.7.1-py3.6.egg/cmlreaders/readers/eeg.py in load(self, **kwargs)
286 rootdir=self.rootdir)
287
--> 288 path = Path(finder.find('sources'))
289 with path.open() as metafile:
290 self.sources_info = json.load(metafile,
~/anaconda3/envs/CML/lib/python3.6/site-packages/cmlreaders-0.7.1-py3.6.egg/cmlreaders/path_finder.py in find(self, data_type)
120 raise InvalidDataTypeRequest("Unknown data type")
121
--> 122 expected_path = self._lookup_file(data_type)
123
124 return expected_path
~/anaconda3/envs/CML/lib/python3.6/site-packages/cmlreaders-0.7.1-py3.6.egg/cmlreaders/path_finder.py in _lookup_file(self, data_type)
160 session=self.session,
161 localization=self.localization,
--> 162 montage=self.montage)
163 return expected_path
164
~/anaconda3/envs/CML/lib/python3.6/site-packages/cmlreaders-0.7.1-py3.6.egg/cmlreaders/path_finder.py in _find_single_path(self, paths, **kwargs)
210 if len(found_files) == 0:
211 raise FileNotFoundError("Unable to find the requested file in any "
--> 212 "of the expected locations:\n {}".format('\n'.join(checked_paths)))
213
214 if len(found_files) > 1:
FileNotFoundError: Unable to find the requested file in any of the expected locations:
/protocols/r1/subjects/R1111M/experiments/FR1/sessions/None/ephys/current_processed/sources.json
Because load_eeg sees that "sessions" is None, even though it doesn't need the user to supply that information if it already has a dataframe of events.
If the session numbers in events exactly matches a session number supplied when the object was instantiated, the reader works. But this almost defeats the purpose of being able to give the reader arbitrary lists of events.
I should be able to use the path finder to get the path to r1.json
.
ramutils
is required to run some tests, but the ramutils conda package is broken if we want to use different versions of other dependencies in cmlreaders
(see pennmem/ram_utils#219).
When rereferencing/filtering, we currently load in all data first before dropping what we don't need. Instead, we should infer which data we need and only load that.
Right now, the EEG reader allows you to filter based on events, but another common use case is to only load eeg corresponding to particular channels. @mivade's proposed API was:
# get electrode contact info as a DataFrame
# this will have contact labels, locations, regions, coordinates, etc.
contacts = reader.load('contacts')
# or specify only contacts that are located in the MTL
# require_monopolar will raise an exception if monopolar is not possible
subset_eeg = reader.load('eeg',
contacts=contacts[contacts.region == 'MTL'],
require_monopolar=True)
As part of pre-release testing, cmlreaders should be used to conduct some high level data quality checks:
Unless a user is paying careful attention, Jupyter, CMLReader, pandas, and MNE have dependency conflicts, causing Jupyter kernel errors. Some of these might not be obvious to a developer who is testing functionality in plain Python or iPython.
Usually, it seems Jupyter issues can be resolved by simply re-installing Jupyter ('conda install jupyter') after they've installed other packages. But this isn't ideal, especially if we're interested in getting first-time users up-and-running with a simple conda install. The current setup virtually guarantees that new lab members will attempt to get a CML/PTSA/Jupyter Anaconda environment working, fail, and come to others looking for help.
I've found that for Python 3 environments, at the least:
I suspect these issues will change or get swapped for other problems as any of the above packages are updated.
Loading pairs or contacts can sometimes take close to a second. Since this data is not going to change if loading from the same subject, we should figure out a way to cache the results.
Alternatively, we could think of ways to improve the parsing of the horribly nested JSON structure.
It's great that users now have an easy way to get electrode_categories information from a cml reader object.
As this is critical information for many analysis pipelines, it would be ideal if electrode_categories information is integrated into returned contacts/pairs.json information. That way, a user could easily filter electrodes by SOZ/ictal/lesion before passing to the EEG reader. Currently, users must code a routine wherein electrode labels in electrode_categories dicts are matched to labels in a pairs/contacts dataframe. People are definitely going to make mistakes.
One implementation could be as follows:
If a user passes 'contacts' or 'pairs' (and maybe 'localization') into a reader object:
I understand that this isn't super elegant, but so long as we're storing all of our critical data in these ridiculous ways, I think this is the only option.
TravisCI doesn't support Python 3.7 yet: travis-ci/travis-ci#9069
Since we're using conda for CI testing, the script should be updated to not use the Python version in the first place, but rather an environment variable to choose which version to install via conda.
PathFinder
has weird behavior because a lot of things default to None
. For example:
>>> finder = PathFinder('R1111M')
>>> finder.find('task_events')
FileNotFoundError Traceback (most recent call last)
<ipython-input-5-8a564b501890> in <module>()
----> 1 finder.find('task_events')
~/src/cmlreaders/cmlreaders/path_finder.py in find(self, data_type)
111 raise InvalidDataTypeRequest("Unknown data type")
112
--> 113 expected_path = self._lookup_file(data_type)
114
115 return expected_path
~/src/cmlreaders/cmlreaders/path_finder.py in _lookup_file(self, data_type)
150 session=self.session,
151 localization=self.localization,
--> 152 montage=self.montage)
153 return expected_path
154
~/src/cmlreaders/cmlreaders/path_finder.py in _find_single_path(self, paths, **kwargs)
200 if len(found_files) == 0:
201 raise FileNotFoundError("Unable to find the requested file in any "
--> 202 "of the expected locations:\n {}".format('\n'.join(checked_paths)))
203
204 if len(found_files) > 1:
FileNotFoundError: Unable to find the requested file in any of the expected locations:
/protocols/r1/subjects/R1111M/experiments/None/sessions/None/behavioral/current_processed/task_events.json
The BaseCMLReader
immediately tries to find files in __init__
:
cmlreaders/cmlreaders/base_reader.py
Lines 60 to 76 in 9423bd9
This leads to some awkward exception handling logic if you want to optionally load something because you have to put the try...except
around the creation of a reader object. It would be far more natural to do this around reader.load
.
An example of what I mean follows. What you have to do now is:
try:
category_reader = ElectrodeCategoriesReader(
data_type="electrode_categories",
subject=self.subject,
experiment=self.experiment,
session=self.session,
localization=self.localization,
montage=self.montage,
rootdir=self.rootdir,
)
except FileNotFoundError:
print("oops")
categories = category_reader.load()
Ideally, we would instead do:
category_reader = ElectrodeCategoriesReader(
data_type="electrode_categories",
subject=self.subject,
experiment=self.experiment,
session=self.session,
localization=self.localization,
montage=self.montage,
rootdir=self.rootdir,
)
try:
categories = category_reader.load()
except FileNotFoundError:
print("oops")
Example with R1264P:
IndexError Traceback (most recent call last)
<ipython-input-12-724e0abfb97a> in <module>()
----> 1 get_resting_connectivity(subject, rootdir)
<ipython-input-11-2a66889ef259> in get_resting_connectivity(subject, rootdir)
11 events = connectivity.get_countdown_events(reader)
12 resting = connectivity.countdown_to_resting(events, rate)
---> 13 eeg = connectivity.read_eeg_data(reader, resting, reref=True)
14 eeg_data.append(eeg)
15
~/src/thetamod/thetamod/connectivity.py in read_eeg_data(reader, events, reref)
112
113 eeg = reader.load_eeg(events=events, rel_start=0, rel_stop=1000,
--> 114 scheme=scheme)
115
116 return eeg
~/src/cmlreaders/cmlreaders/cmlreader.py in load_eeg(self, events, rel_start, rel_stop, epochs, contacts, scheme)
162 })
163
--> 164 return self.load('eeg', **kwargs)
~/src/cmlreaders/cmlreaders/cmlreader.py in load(self, data_type, file_path, **kwargs)
92 montage=self.montage,
93 file_path=file_path,
---> 94 rootdir=self.rootdir).load(**kwargs)
95
96 def load_eeg(self, events: Optional[pd.DataFrame] = None,
~/src/cmlreaders/cmlreaders/readers/eeg.py in load(self, **kwargs)
272 kwargs['epochs'] = epochs
273
--> 274 return self.as_timeseries(**kwargs)
275
276 def as_dataframe(self):
~/src/cmlreaders/cmlreaders/readers/eeg.py in as_timeseries(self, epochs, contacts, scheme)
346 if not reader.rereferencing_possible:
347 raise RereferencingNotPossibleError
--> 348 data = self.rereference(data, scheme)
349
350 # TODO: channels, tstart
~/src/cmlreaders/cmlreaders/readers/eeg.py in rereference(self, data, scheme)
377 c1, c2 = scheme.contact_1 - 1, scheme.contact_2 - 1
378 reref = np.array(
--> 379 [data[i, c1, :] - data[i, c2, :] for i in range(data.shape[0])]
380 )
381 return reref
~/src/cmlreaders/cmlreaders/readers/eeg.py in <listcomp>(.0)
377 c1, c2 = scheme.contact_1 - 1, scheme.contact_2 - 1
378 reref = np.array(
--> 379 [data[i, c1, :] - data[i, c2, :] for i in range(data.shape[0])]
380 )
381 return reref
IndexError: index 78 is out of bounds for axis 1 with size 78
I've only checked a few cases, but it looks like this is happening with pre-System 3 subjects. I have not encountered this problem with any monopolar Ramulator subjects.
Sorting directories named similar to 20180301_180306
should rely on the directory name rather than the mtime since it's conceivable that the mtime could wildly mismatch the name.
As things stand, we have to import any new readers in cmlreaders/readers/__init__.py
. This should be reworked to either automatically discover readers to import or restructure things to make this unnecessary.
There was a bug that prevented using load_eeg
without any kwargs (i.e., to load an entire session). This has been fixed in an upcoming PR, but we should add a good test for this. Do we have any known very short sessions to make this test quick?
ramutils
does a lot of version pinning which can tend to cause a lot of issues. We should just run those tests locally or on rhino or something and not run them on TravisCI.
Unity-based experiments produce a session.json file instead of experiment.log and session.log. This file should be supported by CMLReaders. It is formatted as one valid json string per line, so the parsing should be straightforward.
Example:
reader = CMLReader("R1111M", "PS2", 0)
events = reader.load("events")
---------------------------------------------------------------------------
FileNotFoundError Traceback (most recent call last)
<ipython-input-37-2a71e3ba88c9> in <module>()
----> 1 events = reader.load("events")
2 pairs = reader.load("pairs")
~/src/cmlreaders/cmlreaders/cmlreader.py in load(self, data_type, file_path, **kwargs)
194 montage=self.montage,
195 file_path=file_path,
--> 196 rootdir=self.rootdir).load(**kwargs)
197
198 def load_eeg(self, events: Optional[pd.DataFrame] = None,
~/src/cmlreaders/cmlreaders/base_reader.py in __init__(self, data_type, subject, experiment, session, localization, montage, file_path, rootdir)
79 session=session, localization=localization,
80 montage=montage, rootdir=rootdir)
---> 81 self._file_path = finder.find(data_type)
82
83 self.subject = subject
~/src/cmlreaders/cmlreaders/path_finder.py in find(self, data_type)
120 raise InvalidDataTypeRequest("Unknown data type")
121
--> 122 expected_path = self._lookup_file(data_type)
123
124 return expected_path
~/src/cmlreaders/cmlreaders/path_finder.py in _lookup_file(self, data_type)
160 session=self.session,
161 localization=self.localization,
--> 162 montage=self.montage)
163 return expected_path
164
~/src/cmlreaders/cmlreaders/path_finder.py in _find_single_path(self, paths, **kwargs)
210 if len(found_files) == 0:
211 raise FileNotFoundError("Unable to find the requested file in any "
--> 212 "of the expected locations:\n {}".format('\n'.join(checked_paths)))
213
214 if len(found_files) > 1:
FileNotFoundError: Unable to find the requested file in any of the expected locations:
/Users/depalati/mnt/rhino/protocols/r1/subjects/R1111M/experiments/PS2/sessions/0/behavioral/current_processed/all_events.json
Temporary workaround: request task_events
instead of events
.
cmlreaders incorrectly uses the localization number to look up special subject identifiers instead of the montage number. For example, R1006P had a montage change that was not the result of a re-implant. Therefore, the montage number was incremented, but the localization number was not. Attempting to request pairs information for this subject results in an error:
reader = CMLReader(subject="R1006P", localization=0, montage=1)
pairs_df = reader.load("pairs") # returns a key error
Instead, the montage number should always be used when looking up data in /data10/...
When loading data from the /protocols "database", both the localization and montage numbers are needed.
R1384J, experiment FR1, session 1 is split into 2 HDF5 files due to a pause in the session;
CMLReader is unable to load the data for this session.
reader = CMLReader(subject='R1384J',experiment='FR1',session=1)
events = reader.load('events')
eeg = reader.load_eeg(events,0,100)
Traceback (most recent call last):
File "/Users/leond/anaconda2/envs/thetamod/lib/python3.6/site-packages/IPython/core/interactiveshell.py", line 2963, in run_code
exec(code_obj, self.user_global_ns, self.user_ns)
File "<ipython-input-13-cbd19709a38f>", line 1, in <module>
eeg = reader.load_eeg(events,0,100)
File "/Users/leond/anaconda2/envs/thetamod/lib/python3.6/site-packages/cmlreaders/cmlreader.py", line 164, in load_eeg
return self.load('eeg', **kwargs)
File "/Users/leond/anaconda2/envs/thetamod/lib/python3.6/site-packages/cmlreaders/cmlreader.py", line 94, in load
rootdir=self.rootdir).load(**kwargs)
File "/Users/leond/anaconda2/envs/thetamod/lib/python3.6/site-packages/cmlreaders/readers/eeg.py", line 274, in load
return self.as_timeseries(**kwargs)
File "/Users/leond/anaconda2/envs/thetamod/lib/python3.6/site-packages/cmlreaders/readers/eeg.py", line 343, in as_timeseries
data = reader.read()
File "/Users/leond/anaconda2/envs/thetamod/lib/python3.6/site-packages/cmlreaders/readers/eeg.py", line 210, in read
data = np.array([ts[epoch[0]:epoch[1], :].T for epoch in self.epochs])
ValueError: could not broadcast input array from shape (178,100) into shape (178)
For some subjects that have split EEGs, CML readers does not correctly handle non-sequential or missing channels. For example:
reader = CMLReader(subject="R1006P", experiment="FR2", session=1)
eeg = reader.load_eeg()
The returned timeseries includes CH0, CH1, CH99, and CH100. None of these channels exist in the noreref directory for this subject/experiment/session combination. It appears that cmlreaders is assuming that all channels exist consecutively. It is not clear why these channels are not in the noreref directory, but the channels that are present are consistent with the channels listed in the jacksheet. At least for this subject, the split channels are also consistent with what is in the pairs.json file for the montage that was used in this session.
Python 3.7 is now available. As soon as it is available with conda, testing on 3.7 should be enabled on TravisCI.
The electrode category file is highly irregular, but we do need to be able to consistently read it in a reasonable form
Test data output is currently written to cmlreaders/test/data/output
. This should be going to a temporary directory instead.
A common use case when working with events is to load multiple sessions at a time. Right now, this is not as easy as it should be.
all_events = []
for session in sessions_completed:
sess_events = cml.CMLReader(subject="R1409D", experiment="FR6", session=session,
localization=0, montage=0, rootdir=rhino_root).load('task_events')
all_events.append(sess_events)
all_sessions_df = pd.concat(all_events)
The proposed API is to have a special method associated with cml_reader to allow loading multiple sessions of events. The other option is to allow additional kwargs in .load(), but then it becomes extremely difficult to document that function since it would take different parameters depending on the data type being loaded. Instead, we want to mimic the behavior of load_eeg and have it be a separate method associated with the class. Single sessions of events can be loaded using either reader.load() after having specified a session when creating the reader, or by using reader.load_events(sessions=[1]). At a minimum, the following cases should be handled:
reader = CMLReader(subject="R1409D", experiment="FR6")
# Load all sessions
all_fr6_events = reader.load_events()
# Load specific sessions
subset_fr5_events = reader.load_events(sessions=[0, 1])
reader = CMLReader(subject="R1409D")
# Invalid Request
all_events = reader.load_events()
# Load sessions across experiments
all_record_only_events = reader.load_events(experiments=['catFR1', 'FR1'])
Depending on if it is important enough of a use case, it could also handle the following cases:
reader = CMLReader(subject="R1409D")
# Multi-experiment, single session
multi_exp_single_sess = reader.load_events(experiments=['catFR1', 'FR1'], sessions=[0])
# Multi-experiment, multi-session
multi_exp_multi_sess = reader.load_events(experiments=['catFR1', 'FR1'], sessions=[0, 1])
Instead of
self.rootdir = rootdir
PathFinder
should have
self.rootdir = os.path.expanduser(rootdir)
We can use the subject ID in PathFinder
to determine which directory in /protocols
to use.
When loading a jacksheet, I get a dataframe back that only has one column: "channel_label"
. This combines both the jackbox number and the contact label. We should instead have two columns named "number"
and "label"
.
Example: trying to filter events with a resulting DataFrame with no rows results in an IndexError
when concatenating TimeSeries
objects because none get added in the as_timeseries
method. We should explicitly check for this case and raise a more helpful error message.
Trying to build a package for pennmem/artdet, I get import errors because cmlreaders
requires pandas but is not listed as a requirement in conda.recipe/meta.yaml
.
pkgutil
defies all logic in how it works, and apparently as written, the magic importing only works if you are in the same directory as the cmlreaders
package...
The TimeSeries
class serves as a simple container for EEG data which can then be exported to other formats for actual analysis. This leads to some confusion (especially since PTSA has a class with the same name). Instead, it should be named EEGData
or something similar.
Rather than defaulting to 0, CMLReader
should read the data index (this can be cached to avoid re-reading several times) and determine localization and montage number from there when session
is specified. In cases where either is nan
, 0 can be assumed.
Example: MontageReader
doesn't know if it's trying to read contacts
or pairs
.
Pandas has a notion of accessors to add additional namespaced functionality to DataFrame
and other pandas objects. We could use this by adding things like event
accessors which can do some common queries, for example something like:
@pd.api.extensions.register_dataframe_accessor("events")
class EventsAccessor(object):
...
@property
def stim_events(self):
"""Filter events and return only stim events."""
return self._obj[self._obj["type"] == "STIM_ON"]
Rhino's filesystem can be quite slow, so it is worth considering enabling caching when loading some types of data. This could most easily be accomplished with joblib
.
When reading the data index, I see things like '0'
instead of 0
. This appears to be because they are listed as strings in r1.json
.
EEGReader
's docstring shows some examples of how to use it, but it's missing loading data via events (which is the most common/useful way of loading EEG data). We should update this and also include some examples in the CMLReader.load_eeg
method.
Alternatively, move the examples to CMLReader.load_eeg
and in EEGReader
add a "see also" note to refer to load_eeg
.
EEG files have names similar to R1387E_FR1_0_25Jan18_1826.h5
for Ramulator HDF5 files or R1337E_FR1_0_08Sep17_2045.030
for "split" EEG files. When using the PathFinder
, how can we account for these timestamped filenames?
One idea is to only point to the directory and let EEGReader
handle it from there. Other ideas, @zduey?
@LoganJF says it needs to be installed for him.
E.g., line 155 prints timestamped directories whenever finding Ramulator files. Presumably this was meant for debugging and should be replaced with a call to logger.debug
or removed.
As is, it's a bit of a misnomer because PathFinder
can return file paths and directory paths and is a mouthful (handful?) to type. Instead, I suggest PathFinder.find
.
Current signature:
def __init__(self, subject, rootdir='/', experiment=None, session=None,
localization=None, montage=None):
It would be more natural to be able to type
finder = PathFinder(subject, experiment, session)
which is a more natural ordering than having rootdir
for some reason appear after subject
. I would make it as the last keyword argument.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.