Code Monkey home page Code Monkey logo

cell-feature-data's People

Contributors

aditnath avatar meganrm avatar rugeli avatar schoinh avatar tanyasg avatar toloudis avatar vianamp avatar

Watchers

 avatar  avatar

cell-feature-data's Issues

create script to generate csvs

Each dataset should also generate a csv file with the feature data, containing the following per-row:
cell id, fov id (needed?), each feature value, downloadable file path to cell ome-tiff, downloadable file path to fov ome-tiff (needed?)

Column names for features should be the feature names (or keys?) from the feature defs json file.
also as a proposal:

cell_id
fov_id
cell_download_url
fov_download_url

Interactive user prompts for additional settings

Use Case

Please provide a use case to help us understand your request in context

After the initial dataset files are created, we want to gather more info from users to further complete the dataset data

Acceptance Criteria

Please describe how you know this is done

A log message prompts the user to run a python command enter-required-setting that triggers interactive prompts allowing the user to enter additional settings

Details

Please provide any helpful specifications

the list of required data(this list can be extended as needed):

  1. - `xAxis default`(offers user a selection from the features in `feature_defs` for 1-4)
    
  2. - `yAxis default`
    
  3. - `colorBy`
    
  4. - `groupBy`
    
  5. - `thumbnailRoot`
    
  6. - `volumeViewerDataRoot`
    
  7. - `downloadRoot`
    

add a github action to publish datasets(process_data) to production

Use Case

Please provide a use case to help us understand your request in context

We want data creators to be able to test their data also in production database before releasing it

Acceptance Criteria

Please describe how you know this is done

An authorized user should be able to run process-dataset on main to publish their dataset to production database. The job will be skipped if it's on other branches.

Details

Please provide any helpful specifications

Plan with manually triggered github action:

  1. Run process_dataset to publish a new dataset to staging db (github action no.1) --DONE
  2. Test and check data at staging frontend
  3. Run process_dataset to publish to production db (github action no.2) --TODO, this is the new issue #64
  4. Test and check data at both stag & prod frontend
  5. After testing, release dataset to production database (github action no.3) --ALMOST DONE

question about other required fields

Are these fields all required? (Do the scripts fill them in if not provided in json?)

data.xAxis.default
data.yAxis.default
data.colorBy.default
data.groupBy.default
data.thumbnailRoot
data.downloadRoot
data.volumeViewerDataRoot

I am pretty sure that CFE can handle the last three being empty or undefined and I am more concerned with the first 4.

Use Python CLI with Google's Fire

Use Case

Please provide a use case to help us understand your request in context

As of now, our dataset processing Python script is executed via a Node command npm run create-dataset -- [PATH-TO-CSV-FILE]. It'd be good to run it via Python.

Acceptance Criteria

Please describe how you know this is done

The dataset processor should be executed using a Python command instead of using node
The CLI should follow the format like create-dataset -d [PATH-TO-CSV-FILE]

Details

Please provide any helpful specifications

  • create a file in dataset-processor-python, same level with data_loader.py
  • integrate Google's Fire library to create the new CLI for the Python dataset processor
  • the new CLI should accept argument and options (e.g. -d for the dataset csv file path)

parallel processing is using same temp dir per dataset

When processing a megaset, files are overwriting each other in the temp dir. This causes all kinds of crashes preventing processing from continuing, or worst case, sending up inconsistent or bad data.

I have put a fix in the variance dataset branch.

xAxis and yAxis could be made optional

Use Case

Currently xAxis and yAxis are required.
I was wondering if they could be made optional if we auto-select the first couple of features to populate them in the scripts that live here.

Acceptance Criteria

Evaluate whether this would make data entry easier or whether we should continue to impose the requirement to make data entry more deliberate / intentional.

Initialize CLI and setup new python directory

Use Case

Please provide a use case to help us understand your request in context
Establish the foundational components - setting up CLI and directory, reading input file, providing initial prompts, etc

Acceptance Criteria

Please describe how you know this is done
Should be able to run command create_dataset to read the input file
Users can receive both required and optional prompts

Details

Please provide any helpful specifications

add download links to datasets

add an optional attribute to a dataset description doc, ie dataset.json (not a megaset) called downloadLinks that is an array of objects. Each object has required attibutes:
{
date: string,
title: string,
link: string
},

Update to dataset processing:

On upload this data should be included in the dataset- description information. The schema setup should do this automatically.

For updating the schema:

  1. add a new file in array-items called "dataset-link.schema.json" that has the definition of one single dataset link, ie
        title: string,
        link: string
  1. reference that individual schema in definitions.schema.json, where
datasetLinks:
    type: array
    items: 
        $ref: dataset-link.schema.json
        type: object 
  1. update input-dataset-info.schema.json and dataset-card.schema.json to have datasetLinks that references the object def in definitions.schema.json

Amendments to the datasets themselves:

Add this data to the existing datasets (@meganrm will update this with which link goes to which dataset):
[
{
date: "May/26/2021",
title: "The hiPSC single-cell image dataset",
link: "https://open.quiltdata.com/b/allencell/packages/aics/hipsc_single_cell_image_dataset/tree/latest/metadata.csv",
},
{
date: "Jul/21/2022",
title: "The hiPSC single i1 cell image dataset",
link: "https://open.quiltdata.com/b/allencell/packages/aics/hipsc_single_i1_cell_image_dataset/tree/latest/metadata.csv",
},
{
date: "Jul/20/2022",
title: "The hiPSC single M1 cell image dataset",
link: "https://open.quiltdata.com/b/allencell/packages/aics/hipsc_single_m1_cell_image_dataset/tree/latest/metadata.csv",
},
{
date: "Jul/21/2022",
title: "The hiPSC single i2 cell image dataset",
link: "https://open.quiltdata.com/b/allencell/packages/aics/hipsc_single_i2_cell_image_dataset/tree/latest/metadata.csv",
},
{
date: "Jul/20/2022",
title: "The hiPSC single M2 cell image dataset",
link: "https://open.quiltdata.com/b/allencell/packages/aics/hipsc_single_m2_cell_image_dataset/tree/latest/metadata.csv",
},
{
date: "Jul/19/2022",
title: "The hiPSC single edge cell image dataset",
link: "https://open.quiltdata.com/b/allencell/packages/aics/hipsc_single_edge_cell_image_dataset/tree/latest/metadata.csv",
},
{
date: "Sep/28/2022",
title: "The hiPSC single non-edge cell image dataset",
link: "https://open.quiltdata.com/b/allencell/packages/aics/hipsc_single_nonedge_cell_image_dataset/tree/latest/metadata.csv",
},

Re-publish docs with new schema:

npm run publish-docs

add transform data

We need to add per-image data containing a transform object.
Propose that the data be considered as a new top-level data item in the features file alongside file_info and features.

file_info: [...],
features: [...],
transform: {
  translate:[number, number, number]=[0,0,0],
  rotate:[number,number,number]=[0,0,0]
}

transform needs to be completely optional, allowed to be missing from any image.
The data needs to be queryable per-selected-cell, and it would be ok (preferred?) to include it in the file_info request since that is already fully implemented in CFE.

Alternative:

include this translate and rotate data as actual features, make them always hidden from plot, and add dataset info to indicate the feature key names that contain the transform data (translatexfeature: "translate_x", etc). Then CFE can use the dataset info feature keys to pull out the transform data, if it exists, to pass into the 3d viewer.

validate totalCells and totalFOVS

Use Case

A dataset must have a non-zero value for either totalCells or totalFOVs or both, in userData.

Acceptance Criteria

Add a validation function to check values in userData/totalCells and userData/totalFOVs, error if both values are zero

idea: Python create_dataset script

Use Case

More tooling to help scientists set up data for CFE. Goal is to make it simple: take in a spreadsheet of features, output a basic dataset. it's ok if it's not fleshed out completely but it should validate.

Acceptance Criteria

Python script that can be run from command line or invoked from other python code.
It should:

  • set up a dataset directory and create all necessary json files.
  • accept a spreadsheet of features (spec tbd but probably requires an "id" column, and the rest can be feature names
  • automatically identify string-valued features and turn them into discrete features with options
  • fill in feature defs (and all other json files) as best it can
  • some basic validation, tell user if the spreadsheet results in bad data
  • accept csv file or pandas dataframe
  • leave image processing to other code. optionally fill in image names as placeholders
  • bonus: provide an "interactive" mode that prompts for some data entry on command line (e.g. feature descriptions, units, some of the dataset.json fields, etc)

Details

Must be Python so other scientists can import it

improve ux for data creators

Create a single validation script that a data creator can run to ensure integrity prior to attempting to upload a dataset.
npm run validate-dataset [PATH_TO_DATASET]

A second improvement for data creators would be to publish docs online (I feel like maybe we already do this but didn't check -- but improving documentation in general is the gist here)

code reorg: separate data

Put all data subdirs into a data/ directory, and all source code into a src/ to make it very obvious which parts of this repo are code vs data.

Create a folder within data

Use Case

Please provide a use case to help us understand your request in context
Create a folder within data and name it [name-version] with the file structure:

| -- /[dataset_name_version]/
|    |-- dataset.json
|    |-- feature_defs.json
|    |-- cell_feature_analysis.json
|    |-- image_settings.json

Acceptance Criteria

Please describe how you know this is done
the dataset folder should exist within /data and contain the specified files, ready for segregation and storage

Details

Please provide any helpful specifications

audit the "required" settings in the schema for things that are no longer required

What needs to happen?

a few things in the schema are marked as required that aren't, they should be changed
dataset.json: image albumPath? (need to check the front end on this one). viewerSettingPath? seems like it should be optional since the settings themselves are optional
dataDownloadRoot

feature_defs.json: tooltip (need to check front end) description could be optional

we should pair program on this to check the front end settings

validate from command line for single datasets

Now that we can validate all datasets via gh-actions, we should add a script to validate a single dataset that can be run from CLI. And the old validate.js should be removed/replaced by that.

"key" is required if "name" is not unique in options

Use Case

Please provide a use case to help us understand your request in context
Based on discrete-feature-option.schema.json, the key property is required when the name is not unique within options. Our current ajv validator passes datasets that lack a key in options even when name is not unique.

For instance:

// we want validation to fail if `key` is missing in this case
"options": {
            "5": {
                "color": "#77207C",
                "name": "Matrix adhesions",
              },
           "6": {
                "color": "#77207C",
                "name": "Matrix adhesions"
              },
           }

// we want these to pass

"options": {
            "5": {
                "color": "#77207C",
                "name": "Matrix adhesions",
                "key": "Paxillin" 
            },
            "6": {
                "color": "#77207C",
                "name": "Matrix adhesions",
                "key": "Some Key"
            },
         }


Question:

  • How do we determine if name is unique or not? Is there a predefined list of common/unique names?(solved)

Acceptance Criteria

Please describe how you know this is done

  1. check if name is unique
  2. apply conditions to the subschemas in discrete-feature-option.schema - if name is not unique, the key should be required; if name is unique, the key is optional

dataset feature verification log for data creators

Use Case

Please provide a use case to help us understand your request in context

It'd be good if data creators are able to verify the details of the datasets they create. i.e. verify if the feature names in featuresDataOrder match the order of their values in the cell-feature-analysis file.

Acceptance Criteria

Please describe how you know this is done

  • output a map of ordered feature names and values in terminal when running a single dataset validation
  • prompt the users to check the map and update the dataset if there are discrepancies

Add firebase secret tokens to github and do a little code cleanup

I'm not sure how firebase credentials are doled out in general but any user who needs to be able to publish datasets will be blocked without proper authentication.

Perhaps if we had github actions that could publish data then we could store the creds in a single place here in this repo.

display order for megaset cards

The datasets in a megaset are specified as an ordered list in the input json, but they get turned into a keyed object whose order of keys is indeterminate (and at runtime results in random ordering of the cards in CFE)

Marking this as bug because it's unexpected from the point of view of data entry person.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.