kmdouglass / bstore Goto Github PK
View Code? Open in Web Editor NEWLightweight data management and analysis tools for single-molecule fluorescence microscopy.
License: Other
Lightweight data management and analysis tools for single-molecule fluorescence microscopy.
License: Other
The first good place for this is the FAQ.
The query()
method will only return locResults when called, probably because this was all that was needed for batch processing. This functionality should be extended to all datasetTypes.
The best way to introduce this change will likely be to ask the query()
method to return all dataset types by default, but accept optional arguments that restricts the types it returns if needed, such as in batch processing which only needs localizations.
Once this is implemented, Tutorial 1 should be updated to highlight the utility of the query()
method.
Currently convert() of the ConvertHeader class will also write out the column (without an empty header) containing the indexes of the underlying Pandas DataFrame. This should be removed or made optional since the LEB specification does list the index as a primary column.
This will help if a FiducialDriftCorrection processor is reused on multiple DataFrames.
The PEP8 standard should be implemented to ensure that B-Store is easily extended.
Currently, there is no checking for empty dataset types (locResults, etc.) when the database is built. I need to handle the case when these might be empty, for example when no widefield images are present.
The OverlayClusters
multiprocessor was somewhat hastily written for a project. Time should be taken to revise it and make sure that it's is well-implemented for release.
DataSTORM was always intended to be a temporary name. Since the project has solidified into a database management and analysis platform for SMLM data and the version0.1.0 release should happen this year, a final name should be decided upon.
Here are a few candidates that I like and don't clash with anything on Google:
Bernie database
returns the Bernie Sanders database breach. Unfortunately Bernard is a microscopy equipment company somewhere in the US. There's also a Python package supporting Bernie Sanders called BernieScript hereI wonder whether it would be useful to implement a sort of "generic" datasetType that can store misc. images and metadata associated with a dataset?
One could imagine attaching blot images for acquisition group, for example.
Would it be sufficient to implement this datasetType as a numeric array (of arbitrary dimensions) and JSON attributes?
Related to #10 , the Merger could be made column agnostic. To do this, the set of operations on non-critical columns could be defined at runtime, rather than hard coded into the Merger class.
When an unnamed column corresponding to a Pandas index is present, batch processing will throw an error related to an unknown column name '0'. Specifically, this error often appears in the Merger processor.
One fix could be introduced into the mapping between columns. If no mapping exists for a specific column, then its name could be left unchanged.
It would be highly beneficial to retain the particle ID column when merging localizations so that merged localizations can be traced back to their original, unmerged versions.
Not a code a issue but a design issue: how should the information obtained from a parser be used to create Datasets/DatabaseAtoms? The data in an atom can be in one of many formats, so it might not make sense to allow the parser to load the data but rather only return the name information.
If I do this, then what should "load" data? Wrappers around Python's read_csv, etc.?
In the usual work flow, the localizations are drift-corrected and then added to the database. Localizations where no beads are present are labeled with a DCX
suffix in their filename and not added to the database.
This creates a situation where there may exist widefield images and metadata that could be added to the database but the localization metadata won't be added because there will be no localization results and the widefield images will abandoned without any other data attached to them.
What is the best way to handle this situation? Should there be a check, similar to what exists now for metadata, that doesn't put metadata or widefield images if the locResults are missing? Should the non-drift-corrected localization results be added after the drift-corrected locResults along with their metadata and widefield images?
The Travis badge displays an error if the Anaconda servers are too busy, which generate the following error during the Travis build:
Error: HTTPError: 500 Server Error: Internal Server Error for url: https://repo.continuum.io/pkgs/free/linux-64/conda-env-2.5.1-py35_0.tar.bz2: https://repo.continuum.io/pkgs/free/linux-64/conda-env-2.5.1-py35_0.tar.bz2
Is there a way to ignore these kinds of errors, since they are not caused by B-Store but rather by Anaconda? See build 15 for the full report.
PyTables really doesn't like spaces in names, and this includes the prefix
field.
I changed prefix
to a property that automatically converts spaces to underscores, but there should be a unit test for this so changes in the future are caught. It just needs to verify that spaces get converted to underscores.
I originally decided against an append method for Databases, but now I think that such a method makes sense.
Imagine taking replicates on different days, but wanting the replicates to be in the same database. It makes sense then, especially with the new dateID (see #26 and #24 ) to append data into the database.
One way to do this would be to call the Database.build()
method. Should a copy of the hdf file be made before appending?
I noticed that Jupyter won't recognize Conda environments on their latest version. This is fixed if the nb_conda
package is installed over Conda: https://docs.continuum.io/anaconda/jupyter-notebook-extensions#notebook-conda
This should be installed by default since I use the notebooks so regularly.
@christian-7 and I agree that no one will use the software if it's not something they can use within ~5 minutes of installing.
One way to address this is to add a built-in, simple parser that simply converts filenames with a number at the end to DatabaseAtom
for insertion into the database. Essentially, it will only populate the required fields of DatabaseAtom
, which are prefix
, acqID
, and datasetType
.
Another advantage of doing this is that it can also serve as the tutorial for writing one's own parser.
To acknowledge:
Building the database with the build()
method will halt when a metadata file is encountered but there is no localization results file to go with it. This can happen if a dataset could not be drift corrected whereas the others were.
Example error message:
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
/home/douglass/src/DataSTORM/DataSTORM/database.py in _putLocMetadata(self, atom)
447 attrVal = json.dumps(atom.data[currKey])
--> 448 hdf[dataset].attrs[attrKey] = attrVal
449 except KeyError:
h5py/_objects.pyx in h5py._objects.with_phil.wrapper (/home/ilan/minonda/conda-bld/work/h5py/_objects.c:2579)()
h5py/_objects.pyx in h5py._objects.with_phil.wrapper (/home/ilan/minonda/conda-bld/work/h5py/_objects.c:2538)()
/home/douglass/anaconda3/envs/DataSTORM/lib/python3.5/site-packages/h5py/_hl/group.py in __getitem__(self, name)
163 else:
--> 164 oid = h5o.open(self.id, self._e(name), lapl=self._lapl)
165
h5py/_objects.pyx in h5py._objects.with_phil.wrapper (/home/ilan/minonda/conda-bld/work/h5py/_objects.c:2579)()
h5py/_objects.pyx in h5py._objects.with_phil.wrapper (/home/ilan/minonda/conda-bld/work/h5py/_objects.c:2538)()
h5py/h5o.pyx in h5py.h5o.open (/home/ilan/minonda/conda-bld/work/h5py/h5o.c:3551)()
KeyError: 'Unable to open object (Component not found)'
During handling of the above exception, another exception occurred:
LocResultsDoNotExist Traceback (most recent call last)
<ipython-input-4-cc56098e5c75> in <module>()
5
6 # Build the database
----> 7 db.build(parser, searchDirectory, locResultsString = 'locResults_DC.dat')
/home/douglass/src/DataSTORM/DataSTORM/database.py in build(self, parser, searchDirectory, dryRun, locResultsString, locMetadataString, widefieldImageString)
258 pp.pprint(parser.getBasicInfo())
259 if not dryRun:
--> 260 self.put(parser.getDatabaseAtom())
261
262 def _checkKeyExistence(self, atom):
/home/douglass/src/DataSTORM/DataSTORM/database.py in put(self, atom)
575 hdf.close()
576 elif atom.datasetType == 'locMetadata':
--> 577 self._putLocMetadata(atom)
578 elif atom.datasetType == 'widefieldImage':
579 # TODO: widefield images should also have SMLM ID's attached
/home/douglass/src/DataSTORM/DataSTORM/database.py in _putLocMetadata(self, atom)
449 except KeyError:
450 # Raised when the hdf5 key does not exist in the database.
--> 451 raise LocResultsDoNotExist(('Error: Cannot not append metadata. '
452 'No localization results exist with '
453 'these atomic IDs.'))
LocResultsDoNotExist: 'Error: Cannot not append metadata. No localization results exist with these atomic IDs.'
A simple solution would be to add a try...except block and handle these more gracefully, simply skipping the put() operation for this case.
I could also try to catch when this happens in the build dry run.
A notebook with a tutorial on merging should be created.
@christian-7 reported these errors when trying to build a database:
# Import the libraries
from bstore import database, parsers
from pathlib import Path
# Specify the database file and create an HDF database
dbFile = 'database_test.h5'
db = database.HDFDatabase(dbFile)
# Define the parser that reads the files.
# Also specify the directory to search for raw files.
parser = parsers.MMParser()
searchDirectory = Path('Z:/Christian-Sieben/data_HTP/2016-06-03_humanCent_Sas6_A647')
# Build the database
db.build(parser, searchDirectory)
Unexpected error: <class 'ImportError'>
Unexpected error: <class 'UnboundLocalError'>
0 files were successfully parsed.
Obviously, it's not clear at all what caused the error. There are three try/excepts inside database.py that could have caught these. The except blocks should be modified to better isolate the problem.
Widefield images currently have no attributes assigned to them with the B-Store dataset IDs inside the HDF database.
This is not a critical bug, since the IDs can be inferred from the HDF key, but they should be added nevertheless for consistency with locResults.
Perhaps this feature can be added when widefieldImage metadata is added.
I would like to use the HDF5 reader in FIJI to facilitate batch analyses on the widefield images.
To do this, the widefield images require an attribute called element_size_um
that is a 3-element array of 32-bit floats specifying the pixel size. This should be written as an attribute of the widefield images. For example, if the pixel size was 108 nm, then element_size_um = 1,0.108,0.108
.
See http://lmb.informatik.uni-freiburg.de/resources/opensource/imagej_plugins/hdf5.html for more information.
The README provides installation instructions for Linux and Windows. The Linux installation instructions need to be checked for correctness and the Windows instructions need to be written and checked.
The HDF database does a great job at packaging all the files from a single day. However, what happens when we have multiple HDF files from an entire project? These can become unruly to handle when they're spread across multiple directories, just like the the datasets in a single experiment.
I can think of a few solutions to this problem. One solution is to create a class that stores and associates user notes to the HDF files so that their contents can be tracked and easily displayed. The downside to this is that the HDF files may move to different directories, so such a class could not be permanent.
Another solution is to create experimental attributes that are associated with each acquisition group. Then, a HDF file's attributes can be parsed to understand what type of information they contain.
How else can one manage a large number of related HDF files?
When the fiducial tracks become very large, DBSCAN starts to consume A LOT of memory (it consumed all 48 GB on my machine earlier). This is with neighbor radius of 500 and minimum number of samples of about 35,000. It's most prevalent for long, consistent tracks that are uninterrupted.
The development branch should be brought into CI.
This should include moving references in test_files to the new test_data repo.
@MStefko 's code may be found here: https://github.com/MStefko/anchor-for-STORM
After implementing generic types on #39 , I realized that all dataset types can be decoupled from both the Database and Parser classes. This would greatly make it easier to extend B-Store to new dataset types. It also really doesn't make sense to treat locResults, locMetadata, and widefieldImages as separate types from generics.
Since two of the main purposes of B-Store are to be extensible and to be easily understood when the databases are examined by humans, I should probably refactor the code at this point to completely separate datasetTypes from Parser and Database.
pyhull should not be imported by default because it is not available for Windows.
To do: Add a license, probably BSD3.
The project could benefit from an automatic change log. For ideas, see http://keepachangelog.com/ and http://5by5.tv/changelog/127.
Rather than listing sequentially what datasets are added, the dry run could be more informative by listing all the datasetTypes belonging to an ID set.
It could be more useful by testing the placement of data without actually doing so, raising errors, for example, when metadata exists but no locResults exist.
Some of the localization files we've generated could not fit inside memory using ThunderSTORM. Furthermore, they caused a significant slowdown on my computer when processing them in DataSTORM.
To overcome this limitation I can do data chunking, whereby only some data is processed at a time. Implementing chunking in the processors should be
The first step should therefore be to make the batch processor compatible with chunking and the Filter, ConvertHeader, and CleanUp processors.
@christian-7 has pointed out multiple times that the drift correction module doesn't work if you select multiple beads but choose not to use spatial clustering (I think it's the noClustering
argument). This happens because of the way the module sorts localizations in the user-selected ROI's. Rather than look in each selected ROI independently, the module throws out localizations not in the selected ROI's and then attempts to spline fit whatever is left.
This means that if there is no spatial clustering performed on what's left, the module will attempt to fit a single spline to beads that may be located very far apart in the field of view. Turning on spatial clustering first clusters the localizations locally to find independent beads, and then fits a spline to the clusters.
One possible solution is to really separate ROI's within the module during the bead identification so that they are spline fit individually. This could also eliminate the need for spatial clustering, but it might be nice to keep it as an option because beads that are surrounded by dense areas of localizations may not work well for spline fitting.
This relates to #24
The question is: where should the date ID go in the HDF key?
I am adding a date
field to DatabaseAtoms/Datasets before the first release because I can imagine at least one important need for making this a primary dataset ID (albeit still optional, like channelID and posID).
This case would be when you have the same cell type measured on different days, but want to keep these measurements all in the same database with the same prefix
. You might do this if you're performing replicate measurements.
In this case I think the prefix takes precedence, so I am leaning towards a key structure where the prefix comes first, then the optional date, then the prefix followed by the acquisition ID. For example:
Cos7/2016-06-16/Cos7_1/locResults_A647_Pos0
This seems strange to me, though, because my first instinct is to sort by date first:
2016-06-16/Cos7/Cos7_1/locResults_A647_Pos0
I imagine other people will also naturally sort first by date, so this is not a choice I can ignore.
Which of the two is best? Is there a third option? I don't think adding the date to the very end will work because there's no logical connection between datasets with the same acqID
. If I add the date to the end:
Cos7/Cos7_1/locResults_A647_Pos0_2016-06-16
then it might imply that all Cos7_1
datasets on different dates are somehow connected.
Remember that in the end, the hierarchy should make sense from both the computer's and the human's standpoint.
The README could use a short description of the problem it solves. Specifically, B-Store makes it easy to organize many different files from an SMLM dataset and then analyze them.
One possible idea for this section is a graphical description of a number of directories containing localization files, metadata, and other images spread across multiple directories. These are all sorted via an arrow into a single database structure that's clean and compact. From this structure, analyses on the data you want to look at are performed.
This could help better describe B-Store for users wanting to know exactly what it does.
OME-XML metadata may be read most easily with python-bioformats, but the Bioformats project is primarily GPL-licensed, which I would like to avoid.
tifffile on the other hand is BSD-license, actively developed, and appears to read tiff metadata. It also installs with conda using
conda install tifffile -c conda-forge
I think the best way to implement this is to read the OME-XML metadata as a string, then parse this somehow (into JSON?) and save it as an HDF attribute associated with the image. I will likely need to create a new datasetType
to do this, such as widefieldImageMetadata
.
I probably shouldn't have removed outlier rejection in the fiducial bead detection processor, because it worked pretty darned well.
I have possible two options for doing this:
The DatabaseAtom
code should be cleaned up by requiring only acqID
, data
, prefix
, and datasetType
as positional arguments; the rest can be made arguments that default to None.
This should help ease code maintenance and the addition of more fields in the future.
I'm working this at the same time as #23 since the creation of a SimpleParser
should reflect these changes. I'm also adding a date
property as an optional field.
Christian pointed out that the Merge processor and the MergeFang stats computer are missing inputs to modify the column names for x,y, and z. These should be added; otherwise one has to rename the columns themselves.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.