kmdouglass / bstore Goto Github PK

View Code? Open in Web Editor NEW

4.0 12.0 1.0 84.77 MB

Lightweight data management and analysis tools for single-molecule fluorescence microscopy.

License: Other

Python 99.64% Shell 0.35% Batchfile 0.01%

super-resolution microscopy high-throughput analysis

bstore's People

Contributors

Stargazers

Watchers

Forkers

chenzhang93

bstore's Issues

The issue with spaces in DataFrame column names needs to be mentioned in the documenation

The first good place for this is the FAQ.

HDFDatabase.query() only returns locResults datasets

The query() method will only return locResults when called, probably because this was all that was needed for batch processing. This functionality should be extended to all datasetTypes.

The best way to introduce this change will likely be to ask the query() method to return all dataset types by default, but accept optional arguments that restricts the types it returns if needed, such as in batch processing which only needs localizations.

Once this is implemented, Tutorial 1 should be updated to highlight the utility of the query() method.

convert() method also saves column with Pandas indexes

Currently convert() of the ConvertHeader class will also write out the column (without an empty header) containing the indexes of the underlying Pandas DataFrame. This should be removed or made optional since the LEB specification does list the index as a primary column.

Add method for resetting the drift computer

This will help if a FiducialDriftCorrection processor is reused on multiple DataFrames.

Add `version` field to B-Store parent module

Implement PEP8

The PEP8 standard should be implemented to ensure that B-Store is easily extended.

Add test case for empty dataset types upon database build

Currently, there is no checking for empty dataset types (locResults, etc.) when the database is built. I need to handle the case when these might be empty, for example when no widefield images are present.

Clean up the OverlayClusters multiprocessor

The OverlayClusters multiprocessor was somewhat hastily written for a project. Time should be taken to revise it and make sure that it's is well-implemented for release.

Final name for DataSTORM

DataSTORM was always intended to be a temporary name. Since the project has solidified into a database management and analysis platform for SMLM data and the version0.1.0 release should happen this year, a final name should be decided upon.

Here are a few candidates that I like and don't clash with anything on Google:

Lambchop - named after the baby sheep outside my office window
Bernie - Named for Switzerland's St. Bernard dogs. A search for Bernie database returns the Bernie Sanders database breach. Unfortunately Bernard is a microscopy equipment company somewhere in the US. There's also a Python package supporting Bernie Sanders called BernieScript here
Marmot - Named after the Marmot farm on Rochers de Naye. May conflict with the clothing brand and an electron microscopy modeling package.
Logan - Named after my nephew. There are already a few libraries called Logan on Github, though.

Implement a generic container datasetType?

I wonder whether it would be useful to implement a sort of "generic" datasetType that can store misc. images and metadata associated with a dataset?

One could imagine attaching blot images for acquisition group, for example.

Would it be sufficient to implement this datasetType as a numeric array (of arbitrary dimensions) and JSON attributes?

Merger should be made column agnostic

Related to #10 , the Merger could be made column agnostic. To do this, the set of operations on non-critical columns could be defined at runtime, rather than hard coded into the Merger class.

Batch processing fails when index column is present

When an unnamed column corresponding to a Pandas index is present, batch processing will throw an error related to an unknown column name '0'. Specifically, this error often appears in the Merger processor.

One fix could be introduced into the mapping between columns. If no mapping exists for a specific column, then its name could be left unchanged.

Merging should keep the particle ID column

It would be highly beneficial to retain the particle ID column when merging localizations so that merged localizations can be traced back to their original, unmerged versions.

Add a cross-correlation drift correction processor

How should Parser be used for creating DatabaseAtoms?

Not a code a issue but a design issue: how should the information obtained from a parser be used to create Datasets/DatabaseAtoms? The data in an atom can be in one of many formats, so it might not make sense to allow the parser to load the data but rather only return the name information.

If I do this, then what should "load" data? Wrappers around Python's read_csv, etc.?

Account for non-drift-corrected localizations

In the usual work flow, the localizations are drift-corrected and then added to the database. Localizations where no beads are present are labeled with a DCX suffix in their filename and not added to the database.

This creates a situation where there may exist widefield images and metadata that could be added to the database but the localization metadata won't be added because there will be no localization results and the widefield images will abandoned without any other data attached to them.

What is the best way to handle this situation? Should there be a check, similar to what exists now for metadata, that doesn't put metadata or widefield images if the locResults are missing? Should the non-drift-corrected localization results be added after the drift-corrected locResults along with their metadata and widefield images?

Make travis ignore errors when Anaconda server connections fail.

The Travis badge displays an error if the Anaconda servers are too busy, which generate the following error during the Travis build:

Error: HTTPError: 500 Server Error: Internal Server Error for url: https://repo.continuum.io/pkgs/free/linux-64/conda-env-2.5.1-py35_0.tar.bz2: https://repo.continuum.io/pkgs/free/linux-64/conda-env-2.5.1-py35_0.tar.bz2

Is there a way to ignore these kinds of errors, since they are not caused by B-Store but rather by Anaconda? See build 15 for the full report.

Add unit test for MMParser and spaces in prefix.

PyTables really doesn't like spaces in names, and this includes the prefix field.

I changed prefix to a property that automatically converts spaces to underscores, but there should be a unit test for this so changes in the future are caught. It just needs to verify that spaces get converted to underscores.

Database append functionality should be added

I originally decided against an append method for Databases, but now I think that such a method makes sense.

Imagine taking replicates on different days, but wanting the replicates to be in the same database. It makes sense then, especially with the new dateID (see #26 and #24 ) to append data into the database.

One way to do this would be to call the Database.build() method. Should a copy of the hdf file be made before appending?

Force install nb_conda for Conda installs

I noticed that Jupyter won't recognize Conda environments on their latest version. This is fixed if the nb_conda package is installed over Conda: https://docs.continuum.io/anaconda/jupyter-notebook-extensions#notebook-conda

This should be installed by default since I use the notebooks so regularly.

Add a simple parser for ease-of-use

@christian-7 and I agree that no one will use the software if it's not something they can use within ~5 minutes of installing.

One way to address this is to add a built-in, simple parser that simply converts filenames with a number at the end to DatabaseAtom for insertion into the database. Essentially, it will only populate the required fields of DatabaseAtom, which are prefix, acqID, and datasetType.

Another advantage of doing this is that it can also serve as the tutorial for writing one's own parser.

Need an example notebook that demonstrates both alignment and cluster overlays

Add acknowledgements to README.

To acknowledge:

trackpy (merging functionality)
Christian Sieben (for testing and bug hunting)
Niklas Berliner (drift correction; discussions)
SystemsX (funding)
Python community

Database build fails when metadata exists but there is no locResults file

Building the database with the build() method will halt when a metadata file is encountered but there is no localization results file to go with it. This can happen if a dataset could not be drift corrected whereas the others were.

Example error message:

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
/home/douglass/src/DataSTORM/DataSTORM/database.py in _putLocMetadata(self, atom)
    447                 attrVal = json.dumps(atom.data[currKey])
--> 448                 hdf[dataset].attrs[attrKey] = attrVal
    449         except KeyError:

h5py/_objects.pyx in h5py._objects.with_phil.wrapper (/home/ilan/minonda/conda-bld/work/h5py/_objects.c:2579)()

h5py/_objects.pyx in h5py._objects.with_phil.wrapper (/home/ilan/minonda/conda-bld/work/h5py/_objects.c:2538)()

/home/douglass/anaconda3/envs/DataSTORM/lib/python3.5/site-packages/h5py/_hl/group.py in __getitem__(self, name)
    163         else:
--> 164             oid = h5o.open(self.id, self._e(name), lapl=self._lapl)
    165 

h5py/_objects.pyx in h5py._objects.with_phil.wrapper (/home/ilan/minonda/conda-bld/work/h5py/_objects.c:2579)()

h5py/_objects.pyx in h5py._objects.with_phil.wrapper (/home/ilan/minonda/conda-bld/work/h5py/_objects.c:2538)()

h5py/h5o.pyx in h5py.h5o.open (/home/ilan/minonda/conda-bld/work/h5py/h5o.c:3551)()

KeyError: 'Unable to open object (Component not found)'

During handling of the above exception, another exception occurred:

LocResultsDoNotExist                      Traceback (most recent call last)
<ipython-input-4-cc56098e5c75> in <module>()
      5 
      6 # Build the database
----> 7 db.build(parser, searchDirectory, locResultsString = 'locResults_DC.dat')

/home/douglass/src/DataSTORM/DataSTORM/database.py in build(self, parser, searchDirectory, dryRun, locResultsString, locMetadataString, widefieldImageString)
    258                 pp.pprint(parser.getBasicInfo())
    259                 if not dryRun:
--> 260                     self.put(parser.getDatabaseAtom())
    261 
    262     def _checkKeyExistence(self, atom):

/home/douglass/src/DataSTORM/DataSTORM/database.py in put(self, atom)
    575                 hdf.close()
    576         elif atom.datasetType == 'locMetadata':
--> 577             self._putLocMetadata(atom)
    578         elif atom.datasetType == 'widefieldImage':
    579             # TODO: widefield images should also have SMLM ID's attached

/home/douglass/src/DataSTORM/DataSTORM/database.py in _putLocMetadata(self, atom)
    449         except KeyError:
    450             # Raised when the hdf5 key does not exist in the database.
--> 451             raise LocResultsDoNotExist(('Error: Cannot not append metadata. '
    452                                         'No localization results exist with '
    453                                         'these atomic IDs.'))

LocResultsDoNotExist: 'Error: Cannot not append metadata. No localization results exist with these atomic IDs.'

A simple solution would be to add a try...except block and handle these more gracefully, simply skipping the put() operation for this case.

I could also try to catch when this happens in the build dry run.

Add merging tutorial.

A notebook with a tutorial on merging should be created.

Add slice ID to design diagram.

Add EPFL copyright information to all source code

Non-descriptive error messages related to Database build

@christian-7 reported these errors when trying to build a database:

# Import the libraries
from bstore  import database, parsers
from pathlib import Path

# Specify the database file and create an HDF database
dbFile = 'database_test.h5'
db     = database.HDFDatabase(dbFile)

# Define the parser that reads the files.
# Also specify the directory to search for raw files.
parser          = parsers.MMParser()
searchDirectory = Path('Z:/Christian-Sieben/data_HTP/2016-06-03_humanCent_Sas6_A647')

# Build the database
db.build(parser, searchDirectory)

Unexpected error: <class 'ImportError'>
Unexpected error: <class 'UnboundLocalError'>
0 files were successfully parsed.

Obviously, it's not clear at all what caused the error. There are three try/excepts inside database.py that could have caught these. The except blocks should be modified to better isolate the problem.

Widefield images are not assigned SMLM attributes in the HDF database

Widefield images currently have no attributes assigned to them with the B-Store dataset IDs inside the HDF database.

This is not a critical bug, since the IDs can be inferred from the HDF key, but they should be added nevertheless for consistency with locResults.

Perhaps this feature can be added when widefieldImage metadata is added.

Add element_size_um attribute to widefield images

I would like to use the HDF5 reader in FIJI to facilitate batch analyses on the widefield images.

To do this, the widefield images require an attribute called element_size_um that is a 3-element array of 32-bit floats specifying the pixel size. This should be written as an attribute of the widefield images. For example, if the pixel size was 108 nm, then element_size_um = 1,0.108,0.108.

See http://lmb.informatik.uni-freiburg.de/resources/opensource/imagej_plugins/hdf5.html for more information.

Revise the installation instructions

The README provides installation instructions for Linux and Windows. The Linux installation instructions need to be checked for correctness and the Windows instructions need to be written and checked.

Experiment attributes

The HDF database does a great job at packaging all the files from a single day. However, what happens when we have multiple HDF files from an entire project? These can become unruly to handle when they're spread across multiple directories, just like the the datasets in a single experiment.

I can think of a few solutions to this problem. One solution is to create a class that stores and associates user notes to the HDF files so that their contents can be tracked and easily displayed. The downside to this is that the HDF files may move to different directories, so such a class could not be permanent.

Another solution is to create experimental attributes that are associated with each acquisition group. Then, a HDF file's attributes can be parsed to understand what type of information they contain.

How else can one manage a large number of related HDF files?

Slow down caused by DBSCAN

When the fiducial tracks become very large, DBSCAN starts to consume A LOT of memory (it consumed all 48 GB on my machine earlier). This is with neighbor radius of 500 and minimum number of samples of about 35,000. It's most prevalent for long, consistent tracks that are uninterrupted.

Development branch is lacking a travis.yml script

The development branch should be brought into CI.

This should include moving references in test_files to the new test_data repo.

Merge Marcel's widefield drift correction into B-Store

@MStefko 's code may be found here: https://github.com/MStefko/anchor-for-STORM

Refactor code to decouple datasetTypes from Database class

After implementing generic types on #39 , I realized that all dataset types can be decoupled from both the Database and Parser classes. This would greatly make it easier to extend B-Store to new dataset types. It also really doesn't make sense to treat locResults, locMetadata, and widefieldImages as separate types from generics.

Since two of the main purposes of B-Store are to be extensible and to be easily understood when the databases are examined by humans, I should probably refactor the code at this point to completely separate datasetTypes from Parser and Database.

Import pyhull only if on a Linux machine

pyhull should not be imported by default because it is not available for Windows.

Missing license file

To do: Add a license, probably BSD3.

Add a changelog

The project could benefit from an automatic change log. For ideas, see http://keepachangelog.com/ and http://5by5.tv/changelog/127.

Add a custom exception when a parser fails

Database dry-run build could be more useful/informative

Rather than listing sequentially what datasets are added, the dry run could be more informative by listing all the datasetTypes belonging to an ID set.

It could be more useful by testing the placement of data without actually doing so, raising errors, for example, when metadata exists but no locResults exist.

Implement data chunking

Some of the localization files we've generated could not fit inside memory using ThunderSTORM. Furthermore, they caused a significant slowdown on my computer when processing them in DataSTORM.

To overcome this limitation I can do data chunking, whereby only some data is processed at a time. Implementing chunking in the processors should be

easy for Filter, ConvertHeader, and CleanUp
harder for FiducialDriftCorrect and Merge
I have no idea how to implement DBSCAN on chunked data.

The first step should therefore be to make the batch processor compatible with chunking and the Filter, ConvertHeader, and CleanUp processors.

Drift correction has non-intuitive noCluster argument

@christian-7 has pointed out multiple times that the drift correction module doesn't work if you select multiple beads but choose not to use spatial clustering (I think it's the noClustering argument). This happens because of the way the module sorts localizations in the user-selected ROI's. Rather than look in each selected ROI independently, the module throws out localizations not in the selected ROI's and then attempts to spline fit whatever is left.

This means that if there is no spatial clustering performed on what's left, the module will attempt to fit a single spline to beads that may be located very far apart in the field of view. Turning on spatial clustering first clusters the localizations locally to find independent beads, and then fits a spline to the clusters.

One possible solution is to really separate ROI's within the module during the bead identification so that they are spline fit individually. This could also eliminate the need for spatial clustering, but it might be nice to keep it as an option because beads that are surrounded by dense areas of localizations may not work well for spline fitting.

Create documentation from docstrings

Placement of date field in HDF key

This relates to #24

The question is: where should the date ID go in the HDF key?

I am adding a date field to DatabaseAtoms/Datasets before the first release because I can imagine at least one important need for making this a primary dataset ID (albeit still optional, like channelID and posID).

This case would be when you have the same cell type measured on different days, but want to keep these measurements all in the same database with the same prefix. You might do this if you're performing replicate measurements.

In this case I think the prefix takes precedence, so I am leaning towards a key structure where the prefix comes first, then the optional date, then the prefix followed by the acquisition ID. For example:

Cos7/2016-06-16/Cos7_1/locResults_A647_Pos0

This seems strange to me, though, because my first instinct is to sort by date first:

2016-06-16/Cos7/Cos7_1/locResults_A647_Pos0

I imagine other people will also naturally sort first by date, so this is not a choice I can ignore.

Which of the two is best? Is there a third option? I don't think adding the date to the very end will work because there's no logical connection between datasets with the same acqID. If I add the date to the end:

Cos7/Cos7_1/locResults_A647_Pos0_2016-06-16

then it might imply that all Cos7_1 datasets on different dates are somehow connected.

Remember that in the end, the hierarchy should make sense from both the computer's and the human's standpoint.

Add a problem section to the README

The README could use a short description of the problem it solves. Specifically, B-Store makes it easy to organize many different files from an SMLM dataset and then analyze them.

One possible idea for this section is a graphical description of a number of directories containing localization files, metadata, and other images spread across multiple directories. These are all sorted via an arrow into a single database structure that's clean and compact. From this structure, analyses on the data you want to look at are performed.

This could help better describe B-Store for users wanting to know exactly what it does.

Save OME-XML metadata to B-Store database.

OME-XML metadata may be read most easily with python-bioformats, but the Bioformats project is primarily GPL-licensed, which I would like to avoid.

tifffile on the other hand is BSD-license, actively developed, and appears to read tiff metadata. It also installs with conda using

conda install tifffile -c conda-forge

I think the best way to implement this is to read the OME-XML metadata as a string, then parse this somehow (into JSON?) and save it as an HDF attribute associated with the image. I will likely need to create a new datasetType to do this, such as widefieldImageMetadata.

Reimplement clustering or outlier rejection in fiducial bead detection

I probably shouldn't have removed outlier rejection in the fiducial bead detection processor, because it worked pretty darned well.

I have possible two options for doing this:

Use DBSCAN, which can be a memory hog and requires parameter tuning.
Remove fiducials further than a certain distance from the best fit trajectory and refit.

Change optional dataset properties to default arguments

The DatabaseAtom code should be cleaned up by requiring only acqID, data, prefix, and datasetType as positional arguments; the rest can be made arguments that default to None.

This should help ease code maintenance and the addition of more fields in the future.

I'm working this at the same time as #23 since the creation of a SimpleParser should reflect these changes. I'm also adding a date property as an optional field.

Add a custom column name parameter to Merge and MergeFang

Christian pointed out that the Merge processor and the MergeFang stats computer are missing inputs to modify the column names for x,y, and z. These should be added; otherwise one has to rename the columns themselves.