allen-cell-animated / cell-feature-data Goto Github PK

View Code? Open in Web Editor NEW

This project forked from alleninstitute/cell-feature-data

0.0 0.0 0.0 94.04 MB

This repo handles validating, processing and uploading CFE datasets

Home Page: https://allen-cell-animated.github.io/cell-feature-data/

JavaScript 99.00% Makefile 1.00%

cell-feature-data's People

Contributors

Watchers

cell-feature-data's Issues

Initialize CLI and setup new python directory

Use Case

Please provide a use case to help us understand your request in context
Establish the foundational components - setting up CLI and directory, reading input file, providing initial prompts, etc

Acceptance Criteria

Please describe how you know this is done
Should be able to run command create_dataset to read the input file
Users can receive both required and optional prompts

Details

Please provide any helpful specifications

validate from command line for single datasets

Now that we can validate all datasets via gh-actions, we should add a script to validate a single dataset that can be run from CLI. And the old validate.js should be removed/replaced by that.

add transform data

We need to add per-image data containing a transform object.
Propose that the data be considered as a new top-level data item in the features file alongside file_info and features.

file_info: [...],
features: [...],
transform: {
  translate:[number, number, number]=[0,0,0],
  rotate:[number,number,number]=[0,0,0]
}

transform needs to be completely optional, allowed to be missing from any image.
The data needs to be queryable per-selected-cell, and it would be ok (preferred?) to include it in the file_info request since that is already fully implemented in CFE.

Alternative:

include this translate and rotate data as actual features, make them always hidden from plot, and add dataset info to indicate the feature key names that contain the transform data (translatexfeature: "translate_x", etc). Then CFE can use the dataset info feature keys to pull out the transform data, if it exists, to pass into the 3d viewer.

discrete features must have options

Use Case

CFE crashes if a discrete feature does not have options.
I can make a change to CFE to use the index value as label (allen-cell-animated/cell-feature-explorer#138) but extra option metadata would be missing.

Acceptance Criteria

Validation should flag this?
Do the scripts currently auto-create options if they are not provided?

Use Python CLI with Google's Fire

Use Case

Please provide a use case to help us understand your request in context

As of now, our dataset processing Python script is executed via a Node command npm run create-dataset -- [PATH-TO-CSV-FILE]. It'd be good to run it via Python.

Acceptance Criteria

Please describe how you know this is done

The dataset processor should be executed using a Python command instead of using node
The CLI should follow the format like create-dataset -d [PATH-TO-CSV-FILE]

Details

Please provide any helpful specifications

create a file in dataset-processor-python, same level with data_loader.py
integrate Google's Fire library to create the new CLI for the Python dataset processor
the new CLI should accept argument and options (e.g. -d for the dataset csv file path)

Correct file paths after directory restructure

What needs to happen?

Update the paths that were missed in the data reorg maintenance PR #54

audit the "required" settings in the schema for things that are no longer required

What needs to happen?

a few things in the schema are marked as required that aren't, they should be changed
dataset.json: image albumPath? (need to check the front end on this one). viewerSettingPath? seems like it should be optional since the settings themselves are optional
dataDownloadRoot

feature_defs.json: tooltip (need to check front end) description could be optional

we should pair program on this to check the front end settings

Add firebase secret tokens to github and do a little code cleanup

I'm not sure how firebase credentials are doled out in general but any user who needs to be able to publish datasets will be blocked without proper authentication.

Perhaps if we had github actions that could publish data then we could store the creds in a single place here in this repo.

display order for megaset cards

The datasets in a megaset are specified as an ordered list in the input json, but they get turned into a keyed object whose order of keys is indeterminate (and at runtime results in random ordering of the cards in CFE)

Marking this as bug because it's unexpected from the point of view of data entry person.

parallel processing is using same temp dir per dataset

When processing a megaset, files are overwriting each other in the temp dir. This causes all kinds of crashes preventing processing from continuing, or worst case, sending up inconsistent or bad data.

I have put a fix in the variance dataset branch.

validate totalCells and totalFOVS

Use Case

A dataset must have a non-zero value for either totalCells or totalFOVs or both, in userData.

Acceptance Criteria

Add a validation function to check values in userData/totalCells and userData/totalFOVs, error if both values are zero

data creators should be able to publish to staging

Data creators need to have some way to see their data and test things out without developers' intervention.

add a github action to publish datasets(process_data) to production

Use Case

Please provide a use case to help us understand your request in context

We want data creators to be able to test their data also in production database before releasing it

Acceptance Criteria

Please describe how you know this is done

An authorized user should be able to run process-dataset on main to publish their dataset to production database. The job will be skipped if it's on other branches.

Details

Please provide any helpful specifications

Plan with manually triggered github action:

Run process_dataset to publish a new dataset to staging db (github action no.1) --DONE
Test and check data at staging frontend
Run process_dataset to publish to production db (github action no.2) --TODO, this is the new issue #64
Test and check data at both stag & prod frontend
After testing, release dataset to production database (github action no.3) --ALMOST DONE

Add a github action that publishes to production database. Manually triggered, but only works on `main`.

Use Case

Anyone with write access to this repo should be able to publish a new dataset to production.

Acceptance Criteria

We have a new github action that is manually triggered (only on main branch) to publish a dataset.
It also needs to be able to run release-dataset

Details

Please provide any helpful specifications

Create a folder within data

Use Case

Please provide a use case to help us understand your request in context
Create a folder within data and name it [name-version] with the file structure:

| -- /[dataset_name_version]/
|    |-- dataset.json
|    |-- feature_defs.json
|    |-- cell_feature_analysis.json
|    |-- image_settings.json

Acceptance Criteria

Please describe how you know this is done
the dataset folder should exist within /data and contain the specified files, ready for segregation and storage

Details

Please provide any helpful specifications

question about other required fields

Are these fields all required? (Do the scripts fill them in if not provided in json?)

data.xAxis.default
data.yAxis.default
data.colorBy.default
data.groupBy.default
data.thumbnailRoot
data.downloadRoot
data.volumeViewerDataRoot

I am pretty sure that CFE can handle the last three being empty or undefined and I am more concerned with the first 4.

improve ux for data creators

Create a single validation script that a data creator can run to ensure integrity prior to attempting to upload a dataset.
npm run validate-dataset [PATH_TO_DATASET]

A second improvement for data creators would be to publish docs online (I feel like maybe we already do this but didn't check -- but improving documentation in general is the gist here)

idea: Python create_dataset script

Use Case

More tooling to help scientists set up data for CFE. Goal is to make it simple: take in a spreadsheet of features, output a basic dataset. it's ok if it's not fleshed out completely but it should validate.

Acceptance Criteria

Python script that can be run from command line or invoked from other python code.
It should:

set up a dataset directory and create all necessary json files.
accept a spreadsheet of features (spec tbd but probably requires an "id" column, and the rest can be feature names
automatically identify string-valued features and turn them into discrete features with options
fill in feature defs (and all other json files) as best it can
some basic validation, tell user if the spreadsheet results in bad data
accept csv file or pandas dataframe
leave image processing to other code. optionally fill in image names as placeholders
bonus: provide an "interactive" mode that prompts for some data entry on command line (e.g. feature descriptions, units, some of the dataset.json fields, etc)

Details

Must be Python so other scientists can import it

code reorg: separate data

Put all data subdirs into a data/ directory, and all source code into a src/ to make it very obvious which parts of this repo are code vs data.

create script to generate csvs

Each dataset should also generate a csv file with the feature data, containing the following per-row:
cell id, fov id (needed?), each feature value, downloadable file path to cell ome-tiff, downloadable file path to fov ome-tiff (needed?)

Column names for features should be the feature names (or keys?) from the feature defs json file.
also as a proposal:

cell_id
fov_id
cell_download_url
fov_download_url

groupBy.default is required and currently must be discrete

Use Case

Data creators can currently leave out groupBy in the json input but it is required by CFE.

Acceptance Criteria

Validation should flag this. Do the scripts auto-populate groupby if it is missing?

generate feature glossary per dataset

Currently there is only one global feature glossary at https://www.allencell.org/glossary-of-cell-features-v2.html but as we have more and more datasets, maybe it makes sense to generate one per dataset.

It would be nice to at least have an automated way to update it from the contents of this data repo.

"key" is required if "name" is not unique in options

Use Case

Please provide a use case to help us understand your request in context
Based on discrete-feature-option.schema.json, the key property is required when the name is not unique within options. Our current ajv validator passes datasets that lack a key in options even when name is not unique.

For instance:

// we want validation to fail if `key` is missing in this case
"options": {
            "5": {
                "color": "#77207C",
                "name": "Matrix adhesions",
              },
           "6": {
                "color": "#77207C",
                "name": "Matrix adhesions"
              },
           }

// we want these to pass

"options": {
            "5": {
                "color": "#77207C",
                "name": "Matrix adhesions",
                "key": "Paxillin" 
            },
            "6": {
                "color": "#77207C",
                "name": "Matrix adhesions",
                "key": "Some Key"
            },
         }

Question:

How do we determine if name is unique or not? Is there a predefined list of common/unique names?(solved)

Acceptance Criteria

Please describe how you know this is done

check if name is unique
apply conditions to the subschemas in discrete-feature-option.schema - if name is not unique, the key should be required; if name is unique, the key is optional

dataset feature verification log for data creators

Use Case

Please provide a use case to help us understand your request in context

It'd be good if data creators are able to verify the details of the datasets they create. i.e. verify if the feature names in featuresDataOrder match the order of their values in the cell-feature-analysis file.

Acceptance Criteria

Please describe how you know this is done

output a map of ordered feature names and values in terminal when running a single dataset validation
prompt the users to check the map and update the dataset if there are discrepancies

xAxis and yAxis could be made optional

Use Case

Currently xAxis and yAxis are required.
I was wondering if they could be made optional if we auto-select the first couple of features to populate them in the scripts that live here.

Acceptance Criteria

Evaluate whether this would make data entry easier or whether we should continue to impose the requirement to make data entry more deliberate / intentional.

dataset.json userData/totalCells is required by CFE

currently CFE crashes if you don't supply userData/totalCells.
we should mark this as required for now or change CFE to make it optional at run-time.

add download links to datasets

add an optional attribute to a dataset description doc, ie dataset.json (not a megaset) called downloadLinks that is an array of objects. Each object has required attibutes:
{
date: string,
title: string,
link: string
},

Update to dataset processing:

On upload this data should be included in the dataset- description information. The schema setup should do this automatically.

For updating the schema:

add a new file in array-items called "dataset-link.schema.json" that has the definition of one single dataset link, ie

        title: string,
        link: string

reference that individual schema in definitions.schema.json, where

datasetLinks:
    type: array
    items: 
        $ref: dataset-link.schema.json
        type: object

update input-dataset-info.schema.json and dataset-card.schema.json to have datasetLinks that references the object def in definitions.schema.json

Amendments to the datasets themselves:

Add this data to the existing datasets (@meganrm will update this with which link goes to which dataset):
[
{
date: "May/26/2021",
title: "The hiPSC single-cell image dataset",
link: "https://open.quiltdata.com/b/allencell/packages/aics/hipsc_single_cell_image_dataset/tree/latest/metadata.csv",
},
{
date: "Jul/21/2022",
title: "The hiPSC single i1 cell image dataset",
link: "https://open.quiltdata.com/b/allencell/packages/aics/hipsc_single_i1_cell_image_dataset/tree/latest/metadata.csv",
},
{
date: "Jul/20/2022",
title: "The hiPSC single M1 cell image dataset",
link: "https://open.quiltdata.com/b/allencell/packages/aics/hipsc_single_m1_cell_image_dataset/tree/latest/metadata.csv",
},
{
date: "Jul/21/2022",
title: "The hiPSC single i2 cell image dataset",
link: "https://open.quiltdata.com/b/allencell/packages/aics/hipsc_single_i2_cell_image_dataset/tree/latest/metadata.csv",
},
{
date: "Jul/20/2022",
title: "The hiPSC single M2 cell image dataset",
link: "https://open.quiltdata.com/b/allencell/packages/aics/hipsc_single_m2_cell_image_dataset/tree/latest/metadata.csv",
},
{
date: "Jul/19/2022",
title: "The hiPSC single edge cell image dataset",
link: "https://open.quiltdata.com/b/allencell/packages/aics/hipsc_single_edge_cell_image_dataset/tree/latest/metadata.csv",
},
{
date: "Sep/28/2022",
title: "The hiPSC single non-edge cell image dataset",
link: "https://open.quiltdata.com/b/allencell/packages/aics/hipsc_single_nonedge_cell_image_dataset/tree/latest/metadata.csv",
},

Re-publish docs with new schema:

npm run publish-docs

Interactive user prompts for additional settings

Use Case

Please provide a use case to help us understand your request in context

After the initial dataset files are created, we want to gather more info from users to further complete the dataset data

Acceptance Criteria

Please describe how you know this is done

A log message prompts the user to run a python command enter-required-setting that triggers interactive prompts allowing the user to enter additional settings

Details

Please provide any helpful specifications

the list of required data(this list can be extended as needed):

- `xAxis default`(offers user a selection from the features in `feature_defs` for 1-4)

```
- `yAxis default`
```
```
- `colorBy`
```
```
- `groupBy`
```
```
- `thumbnailRoot`
```
```
- `volumeViewerDataRoot`
```
```
- `downloadRoot`
```

validate when featuresDataORder contains items not found in featureDefs...

Validation scripts should check for when featuresDataOrder contains keys not found in featureDefs.json, and if the length of the list is longer than the number of features in the measured feature data per image. Both are errors.

allen-cell-animated / cell-feature-data Goto Github PK

cell-feature-data's People

Contributors

Watchers

cell-feature-data's Issues

Use Case

Acceptance Criteria

Details

Use Case

Acceptance Criteria

Use Case

Acceptance Criteria

Details

What needs to happen?

What needs to happen?

Use Case

Acceptance Criteria

Use Case

Acceptance Criteria

Details

Plan with manually triggered github action:

Use Case

Acceptance Criteria

Details

Use Case

Acceptance Criteria

Details

Use Case

Acceptance Criteria

Details

Use Case

Acceptance Criteria

Use Case

Acceptance Criteria

Use Case

Acceptance Criteria

Use Case

Acceptance Criteria

Update to dataset processing:

For updating the schema:

Amendments to the datasets themselves:

Re-publish docs with new schema:

Use Case

Acceptance Criteria

Details

Recommend Projects

Recommend Topics

Recommend Org