Code Monkey home page Code Monkey logo

cytominer-transport's Introduction

cytominer-transport

cytominer-transport's People

Contributors

0x00b1 avatar bethac07 avatar

Stargazers

 avatar

Watchers

 avatar  avatar  avatar  avatar

Forkers

gwaybio shntnu

cytominer-transport's Issues

Implementation discussion- parallelization

The following is off-topic (not usage-related) and is getting into implementation, so feel free to ignore for now, or bump to another thread.

It currently takes ~3-5 hours to ingest a file. Our current choice of SQLite does not allow us to do parallel writes, so there's no way to parallelize this.

But we do now, because can store it as a Parquet Dataset which can have multiple files.

So for a 384-well dataset, we can save the output as a Parquet dataset with, say, 24 files (one for each column of the384-well plate). This will also make parallel reads faster, so e.g. aggregation can be faster.

Originally posted by @shntnu in #1 (comment)

Option to output parent/child aligned single cell profiles instead of image/object number

Currently cytominer_transport/_generator.py combines objects based on "ImageNumber" and "ObjectNumber".

Instead, we should consider combining objects by their appropriate "Parent_{compartment}" and "Child_{compartment}".

Pros

  • The profilers are increasingly using single cells as the analytical unit, instead of aggregated profiles
  • if cytominer_transport could output analysis ready single cell data, it will save profilers loads of time (it'll basically obviate a pycytominer step of taking in parquet files, rearranging rows based on parent/child columns and then outputting a new file

Cons

  • If some experiments have a high discrepancy between objects (e.g. 20,000 measurements in one object aligned to a total of 500 measurements in a separate object) will cause a much larger file size
    • Although I am not entirely sure this is not happening even now with pandas.concat(axis=1)

concatenated_object_records = pandas.concat(
[concatenated_object_records, object_records], axis=1
)

Notes

  • I think this functionality is independent of #7 since the fundamental analytical units differ.
  • However, If we do implement this parent/child merge, before a potential aggregation (as described in #7) the results will be different; depending of course on how many times (and the variance of) individual objects are duplicated upon the merge

`to_parquet` initial testing

  1. If my decoder is decoding correctly (which I'm not certain it is), the number of "objects" is the same as the number of rows in the image CSV- ie my Image.csv has two rows, my Cells.csv has 137, but I only see two lines in the file.
  2. Some of the headers are wrong- The value for Area_Shape_Center_X from the Cells.csv is 291.457696827262 (and that value is NOT present at all in Cytoplasm.csv, but when I search that value in my parquet file it is under AreaShape_Center_X_Cytoplasm_Image.

CSV_Input.zip

output.zip

Code run:

from cytominer_transport import to_parquet

example_source = "/Users/bcimini/Desktop/test/transport/per_well/20585_A02/"
example_objects = ["Cells.csv", "Cytoplasm.csv", "Nuclei.csv"]
example_destination = "test_dir"

to_parquet(source=example_source, destination=example_destination, objects=example_objects)

Initial testing

I am trying out the first pass of cytominer-transport. After installing with python setup.py install, I tried the following:

from cytominer_transport import to_parquet

example_source = "s3://imaging-platform/projects/2015_07_01_Cell_Health_Vazquez_Cancer_Broad/workspace/analysis/CRISPR_PILOT_B1/SQ00014610/analysis/"
example_objects = ["cells.csv", "cytoplasm.csv", "nuclei.csv"]
example_destination = "test_dir"

to_parquet(source=example_source, destination=example_destination, objects=example_objects)

and received the error:

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-5-a195cc2e3a3b> in <module>
----> 1 to_parquet(source=example_source, destination=example_destination, objects=example_objects)

~/miniconda3/lib/python3.7/site-packages/cytominer_transport-0.1.0-py3.7.egg/cytominer_transport/_to_parquet.py in to_parquet(source, destination, experiment, image, objects, compression, **kwargs)
     69         image.set_index("ImageNumber")
     70     else:
---> 71         raise FileNotFoundError(filename=pathname)
     72 
     73     # Open object CSVs (e.g. Cells.csv, Cytoplasm.csv, Nuclei.csv, etc.)

TypeError: FileNotFoundError() takes no keyword arguments

I then updated the following variables:

# Just one well now
example_source = "s3://imaging-platform/projects/2015_07_01_Cell_Health_Vazquez_Cancer_Broad/workspace/analysis/CRISPR_PILOT_B1/SQ00014610/analysis/A01-1/"
# Capitalize to match file system
example_objects = ["Cells.csv", "Cytoplasm.csv", "Nuclei.csv"]

And I received the same error

Usage decisions

How shall users interact with the codebase? Let's track our thoughts and decide here

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.