cytomining / cytominer-transport Goto Github PK

View Code? Open in Web Editor NEW

1.0 4.0 2.0 1.15 MB

License: Other

Python 100.00%

cytominer-transport's Introduction

cytominer-transport

cytominer-transport's People

Contributors

Stargazers

Watchers

Forkers

gwaybio shntnu

cytominer-transport's Issues

Implementation discussion- parallelization

The following is off-topic (not usage-related) and is getting into implementation, so feel free to ignore for now, or bump to another thread.

It currently takes ~3-5 hours to ingest a file. Our current choice of SQLite does not allow us to do parallel writes, so there's no way to parallelize this.

But we do now, because can store it as a Parquet Dataset which can have multiple files.

So for a 384-well dataset, we can save the output as a Parquet dataset with, say, 24 files (one for each column of the384-well plate). This will also make parallel reads faster, so e.g. aggregation can be faster.

Originally posted by @shntnu in #1 (comment)

Option to output parent/child aligned single cell profiles instead of image/object number

Currently cytominer_transport/_generator.py combines objects based on "ImageNumber" and "ObjectNumber".

Instead, we should consider combining objects by their appropriate "Parent_{compartment}" and "Child_{compartment}".

Pros

The profilers are increasingly using single cells as the analytical unit, instead of aggregated profiles
if cytominer_transport could output analysis ready single cell data, it will save profilers loads of time (it'll basically obviate a pycytominer step of taking in parquet files, rearranging rows based on parent/child columns and then outputting a new file

Cons

If some experiments have a high discrepancy between objects (e.g. 20,000 measurements in one object aligned to a total of 500 measurements in a separate object) will cause a much larger file size
- Although I am not entirely sure this is not happening even now with pandas.concat(axis=1)

cytominer-transport/src/cytominer_transport/_generator.py

Lines 64 to 66 in 056ced3

    
           concatenated_object_records = pandas.concat( 
        
               [concatenated_object_records, object_records], axis=1 
        
           )

Notes

I think this functionality is independent of #7 since the fundamental analytical units differ.
However, If we do implement this parent/child merge, before a potential aggregation (as described in #7) the results will be different; depending of course on how many times (and the variance of) individual objects are duplicated upon the merge

`to_parquet` initial testing

If my decoder is decoding correctly (which I'm not certain it is), the number of "objects" is the same as the number of rows in the image CSV- ie my Image.csv has two rows, my Cells.csv has 137, but I only see two lines in the file.
Some of the headers are wrong- The value for Area_Shape_Center_X from the Cells.csv is 291.457696827262 (and that value is NOT present at all in Cytoplasm.csv, but when I search that value in my parquet file it is under AreaShape_Center_X_Cytoplasm_Image.

CSV_Input.zip

output.zip

Code run:

from cytominer_transport import to_parquet

example_source = "/Users/bcimini/Desktop/test/transport/per_well/20585_A02/"
example_objects = ["Cells.csv", "Cytoplasm.csv", "Nuclei.csv"]
example_destination = "test_dir"

to_parquet(source=example_source, destination=example_destination, objects=example_objects)

from cytominer_transport import to_parquet

example_source = "s3://imaging-platform/projects/2015_07_01_Cell_Health_Vazquez_Cancer_Broad/workspace/analysis/CRISPR_PILOT_B1/SQ00014610/analysis/"
example_objects = ["cells.csv", "cytoplasm.csv", "nuclei.csv"]
example_destination = "test_dir"

to_parquet(source=example_source, destination=example_destination, objects=example_objects)

and received the error:

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-5-a195cc2e3a3b> in <module>
----> 1 to_parquet(source=example_source, destination=example_destination, objects=example_objects)

~/miniconda3/lib/python3.7/site-packages/cytominer_transport-0.1.0-py3.7.egg/cytominer_transport/_to_parquet.py in to_parquet(source, destination, experiment, image, objects, compression, **kwargs)
     69         image.set_index("ImageNumber")
     70     else:
---> 71         raise FileNotFoundError(filename=pathname)
     72 
     73     # Open object CSVs (e.g. Cells.csv, Cytoplasm.csv, Nuclei.csv, etc.)

TypeError: FileNotFoundError() takes no keyword arguments

I then updated the following variables:

# Just one well now
example_source = "s3://imaging-platform/projects/2015_07_01_Cell_Health_Vazquez_Cancer_Broad/workspace/analysis/CRISPR_PILOT_B1/SQ00014610/analysis/A01-1/"
# Capitalize to match file system
example_objects = ["Cells.csv", "Cytoplasm.csv", "Nuclei.csv"]

And I received the same error

Usage decisions

How shall users interact with the codebase? Let's track our thoughts and decide here

	concatenated_object_records = pandas.concat(
	[concatenated_object_records, object_records], axis=1
	)