jump-cellpainting / datasets Goto Github PK

View Code? Open in Web Editor NEW

150.0 150.0 13.0 78.89 MB

Images and other data from the JUMP Cell Painting Consortium

License: BSD 3-Clause "New" or "Revised" License

Jupyter Notebook 100.00%

datasets's People

Stargazers

Watchers

Forkers

rajivmishra daisukekubota0823 alxndrkalinin shntnu niranjchandrasekaran dpaysan romlambert sailfish009 deflaux gwaybio tracytang0804 complexdata srijitseal

datasets's Issues

Question regarding the illumination correction

Dear JUMP team,

I have a brief question regarding the illumination correction of cpg0016 for source 4.
Do you have used the same methodology (i.e. the same CellProfiler pipeline) to compute the illumination correction function given as a .npy file in the image directories as described in Rohban et al., 2017 and Bray et al. 2016?
Additionally, I was wondering if the images were only corrected for the uneven illumination before nuclei and cells were segmented and their morphological profiles computed for cpg0016 source 4?
Also to obtain the corrected illumination images the raw images were simply divided by the illumination correction function, right?

Thanks a million in advance for the clarifications and also for working on this outstanding data set generation and curation effort!

Some wells in load_data are missing (but are present in `wells.csv.gz`)

Hi there,

I happened to find the metadata for wells of source_10, batch 2021_08_12_U2OS_48_hr_run15, and plate Dest210803-160702 may be missed. May I get your help in double-checking it? Or feel free to correct me if I am not taking it at the right pace.

A quick demo of how to revise it:

S3_LOADDATA_FORMATTER = (
    "s3://cellpainting-gallery/cpg0016-jump/"
    "{Metadata_Source}/workspace/load_data_csv/"
    "{Metadata_Batch}/{Metadata_Plate}/load_data_with_illum.parquet"
)

nan_index = {
    'Metadata_Source': 'source_10',
    "Metadata_Batch": '2021_08_12_U2OS_48_hr_run15',
    "Metadata_Plate": 'Dest210803-160702'
}

s3_path = S3_LOADDATA_FORMATTER.format(**nan_index)
s3_nan_meta = pd.read_parquet(s3_path, storage_options={"anon": True})
wells_from_parquet = s3_nan_meta['Metadata_Well'].unique() # <----- Here wells are enumerated from A01 to C22

wells = pd.read_csv(os.path.join(DATA_ROOT, "metadata", "well.csv.gz"))
wells_plate_info = wells.loc[(wells['Metadata_Source'] == nan_index['Metadata_Source']) & (wells['Metadata_Plate'] == nan_index['Metadata_Plate']), :]
wells_from_plate = wells_plate_info['Metadata_Well'].unique() # <----- Here wells are enumerated from A01 to P24

It seems like the well info provided by wells.csv.gz is larger in amount compared to those retrieved from load_data_with_illum.parquet. Is that a corner case that I missed? Or is that being uploaded in progress?

Thanks for your time and effort.

Best wishes,
Nino

Create FAQ for the resource

I'll start this thread for collating FAQs. To get started, let's link out to issues that seem like good additions to the FAQ-to-be

cpg0012 - error in load_data.csv

Hi there,

Thank you for setting up these amazing datasets! We're having a great time going through them. After some double-checking, I believe there is an error in your load_data.csv files. The AGP and Mito channel filenames are identical. I believe I figured it out...the file names with '_w5' in them are the Mito channel and I can use plate/well/site to correctly assign filenames in the load_data.csv's. If I am correct in that assumption, I have a sqlite db with the correct filenames for the Mito channel if that would be helpful (however only for a subset of the data, ~80 compounds).

Additionally, I am having trouble identifying any brightfield images -- are they not included in this dataset?

Thank you again for your time and efforts!

Cheers, Reese

Why is the workspace/metadata folder missing in cpg0016?

Hi!

Our team at ViQi Inc. have been doing ML analysis on the cpg0012 dataset with a lot of success. We would now like to move on to the jump dataset cpg0016 and possibly cpg0004. However, I cannot find the metadata folders inside of 'workspace' for either of these datasets for any sources. Is the compound and dose information for these datasets elsewhere?

Thank you for your time and efforts!

Cheers, Reese

Missing labels for some wells in COMPOUND_EMPTY plates in source_1

I join the metadata from load_data_with_illum.parquet files with the data in well.csv.gz to download images for a plate and also get the associated perturbations. I noticed that there are no labels in well.csv.gz for some of the COMPOUND_EMPTY wells in load_data_with_illum.parquet for plate UL001661 in source_1. Note: jc.MetadataFiles.{get_well,get_plate} are convenience functions to read the metadata files at commit 4b24577. This is not the most recent commit on main and I will double check with the most recent commit on main, too.

In [72]: import pandas as pd

In [73]: import jump_conversion as jc

In [74]: load_data = pd.read_parquet(Path.home() / 'data/jump.zarr/.cache/cpg0016-jump/source_1/workspace/load_data_csv/Batch1_20221004/UL001661/load_data_with_illum.parquet').assign(Metadata_Plate='UL001661')

In [75]: well = jc.MetadataFiles.get_well()

In [76]: plate = jc.MetadataFiles.get_plate()

In [77]: with_jcp = load_data.merge(well, how='left', on=['Metadata_Plate', 'Metadata_Well'])

In [78]: with_jcp[with_jcp.Metadata_JCP2022.isnull()]
Out[78]:
     Metadata_Source_x   Metadata_Batch Metadata_Plate Metadata_Well Metadata_Site      FileName_IllumAGP  ...                                   PathName_OrigDNA                                    PathName_OrigER                                  PathName_OrigMito                                   PathName_OrigRNA Metadata_Source_y Metadata_JCP2022
184           source_1  Batch1_20221004       UL001661           B02             1  UL001661_IllumAGP.npy  ...  s3://cellpainting-gallery/cpg0016-jump/source_...  s3://cellpainting-gallery/cpg0016-jump/source_...  s3://cellpainting-gallery/cpg0016-jump/source_...  s3://cellpainting-gallery/cpg0016-jump/source_...               NaN              NaN
185           source_1  Batch1_20221004       UL001661           B02             2  UL001661_IllumAGP.npy  ...  s3://cellpainting-gallery/cpg0016-jump/source_...  s3://cellpainting-gallery/cpg0016-jump/source_...  s3://cellpainting-gallery/cpg0016-jump/source_...  s3://cellpainting-gallery/cpg0016-jump/source_...               NaN              NaN
186           source_1  Batch1_20221004       UL001661           B02             3  UL001661_IllumAGP.npy  ...  s3://cellpainting-gallery/cpg0016-jump/source_...  s3://cellpainting-gallery/cpg0016-jump/source_...  s3://cellpainting-gallery/cpg0016-jump/source_...  s3://cellpainting-gallery/cpg0016-jump/source_...               NaN              NaN
187           source_1  Batch1_20221004       UL001661           B02             4  UL001661_IllumAGP.npy  ...  s3://cellpainting-gallery/cpg0016-jump/source_...  s3://cellpainting-gallery/cpg0016-jump/source_...  s3://cellpainting-gallery/cpg0016-jump/source_...  s3://cellpainting-gallery/cpg0016-jump/source_...               NaN              NaN
192           source_1  Batch1_20221004       UL001661           B04             1  UL001661_IllumAGP.npy  ...  s3://cellpainting-gallery/cpg0016-jump/source_...  s3://cellpainting-gallery/cpg0016-jump/source_...  s3://cellpainting-gallery/cpg0016-jump/source_...  s3://cellpainting-gallery/cpg0016-jump/source_...               NaN              NaN
...                ...              ...            ...           ...           ...                    ...  ...                                                ...                                                ...                                                ...                                                ...               ...              ...
3815          source_1  Batch1_20221004       UL001661           U35             4  UL001661_IllumAGP.npy  ...  s3://cellpainting-gallery/cpg0016-jump/source_...  s3://cellpainting-gallery/cpg0016-jump/source_...  s3://cellpainting-gallery/cpg0016-jump/source_...  s3://cellpainting-gallery/cpg0016-jump/source_...               NaN              NaN
4760          source_1  Batch1_20221004       UL001661           Z42             1  UL001661_IllumAGP.npy  ...  s3://cellpainting-gallery/cpg0016-jump/source_...  s3://cellpainting-gallery/cpg0016-jump/source_...  s3://cellpainting-gallery/cpg0016-jump/source_...  s3://cellpainting-gallery/cpg0016-jump/source_...               NaN              NaN
4761          source_1  Batch1_20221004       UL001661           Z42             2  UL001661_IllumAGP.npy  ...  s3://cellpainting-gallery/cpg0016-jump/source_...  s3://cellpainting-gallery/cpg0016-jump/source_...  s3://cellpainting-gallery/cpg0016-jump/source_...  s3://cellpainting-gallery/cpg0016-jump/source_...               NaN              NaN
4762          source_1  Batch1_20221004       UL001661           Z42             3  UL001661_IllumAGP.npy  ...  s3://cellpainting-gallery/cpg0016-jump/source_...  s3://cellpainting-gallery/cpg0016-jump/source_...  s3://cellpainting-gallery/cpg0016-jump/source_...  s3://cellpainting-gallery/cpg0016-jump/source_...               NaN              NaN
4763          source_1  Batch1_20221004       UL001661           Z42             4  UL001661_IllumAGP.npy  ...  s3://cellpainting-gallery/cpg0016-jump/source_...  s3://cellpainting-gallery/cpg0016-jump/source_...  s3://cellpainting-gallery/cpg0016-jump/source_...  s3://cellpainting-gallery/cpg0016-jump/source_...               NaN              NaN

At first, I thought this may be blank images as described in #61 (comment) but plate UL001661 is not listed in that comment. I downloaded one dna channel image for the wells that I identified from

s3://cellpainting-gallery/cpg0016-jump/source_1/images/Batch1_20221004/images/UL001661__2022-10-05T05_07_32-Measurement1/Images/r02c02f01p01-ch4sk1fk1fl1.tiff

I found that the image is not blank but it is very noisy and with strong artifacts plus visible well edge:

Did these wells not pass QA and should be excluded, and are thus not included in the metadata? Can I extrapolate that to any other well that is not available in well.csv.gz?

Thank you!

Using the data

Hi,
Thanks for this awesome work !
I'm starting to look at the data and I have a couple questions. As I understand, the cpg016 is not fully released yet but the other ones are. In order to have a better understanding and start to get used to the data and structure I am planning to use the cpg012 from Bray et al. Could you please provide me indications to start working with it ? I would like to know how to download the data (or stream it if possible) , how to apply the illumination function and QC process and if there is a metadata file (as in the ones in the BBBC022 for example) with name of compounds, plate id , image path 1 to 5 etc ?

Thank you very much for your help !

how to define label in ORF dataset (cpg0016)

thank you for your good job on this great dataset! we would pre-train a image classification model with dataset cpg0016, while I am confused about how to define the label for the classification task correctly, is the column "Metadata_JCP2022" the label？I found that in COMPOUND part and CRISPR part of the dataset, "Metadata_JCP2022" and "Metadata_NCBI_Gene_ID"/"Metadata_InChIKey" are one-one mapping, it seems that "Metadata_JCP2022" could be the label, while in ORF part of the dataset, "Metadata_JCP2022" and "Metadata_NCBI_Gene_ID" are one-to-many mapping, it means that one Metadata_NCBI_Gene_ID has more than one Metadata_JCP2022 id. My question is how to define label for classification task in ORF dataset？Metadata_NCBI_Gene_ID？or Metadata_JCP2022？ thanks a lot!

Reinstate Brightfield channels in LoadData CSVs

Thanks for providing the fluorescent images in well-structured csv/parquet files. How can I find the brightfield images in the same way? i.e., for a given plate and well, I'd want to find the brightfield images and their associated 5 fluoro images.

Add citation file, and maybe Zenodo json file

A citation .cff woudl be good to have for citing this repository.

If we want to update the "type" in Zenodo to datasets rather than software, we probably need to put a .zenodo.json at the root of our repository - we can also add contributors affiliations/names/ORCIDs to this spec if we wish. This is a draft based on the metadata it provided the first upload, and its documentation for GH integraions. We can add papers as "related identifiers" as we like.

{
    "license": "other-open", 
    "title": "jump-cellpainting/datasets: JUMP-CP Datasets", 
    "upload_type": "dataset", 
    "access_right": "open", 
    "related_identifiers": [
        {
            "scheme": "doi", 
            "identifier": "10.5281/zenodo.7628768", 
            "relation": "isVersionOf"
        }
    ]
}

Error in illumination correction filenames for source_11

There is an error in the illumination correction file metadata for one of your plates:

s3://cellpainting-gallery/cpg0016-jump/source_11/images/Batch2/illum/EC000038/EC000001_IllumAGP.npy
s3://cellpainting-gallery/cpg0016-jump/source_11/images/Batch2/illum/EC000038/EC000001_IllumDNA.npy
s3://cellpainting-gallery/cpg0016-jump/source_11/images/Batch2/illum/EC000038/EC000001_IllumER.npy
s3://cellpainting-gallery/cpg0016-jump/source_11/images/Batch2/illum/EC000038/EC000001_IllumMito.npy
s3://cellpainting-gallery/cpg0016-jump/source_11/images/Batch2/illum/EC000038/EC000001_IllumRNA.npy

These should be /EC000038/EC000038_Illum*.npy

(h/t A.L.)

how to get bucket name and endpoints?

Thank you for sharing this great dataset! Now we want to make a copy of this dataset on another cloud, but we found this error: endpoint and bucket not match, and we use the bucket : arn:aws:s3:::cellpainting-gallery, endpoint:https://dataexchange.us-east-1.amazonaws.com. Could you point the correct endpoint and bucket name? Thank you very much!

2020_11_04_CPJUMP1 (cpg0000-jump-pilot) dataset

Hello, I would like to download the dataset cpg0000-jump-pilot to work on the 300+ compounds. My question is where to find the name of the compounds (is there a csv file for that) and is there an information about cell death modality induced by those 300+compounds (such as necrosis, apoptosis)? Many thanks in advance!

Release SMILES strings

Hello @shntnu! I noticed that for a lot of the compounds in metadata/compound.csv, there are no corresponding SMILES strings that can be found on PubChem. Is it possible to provide another column for SMILES? Thank you!

Not able to find cp001 in cp-gallery

Hello,
I am trying to download the four JUMP datasets, and looking at https://cellpainting-gallery.s3.amazonaws.com/index.html I see a number of folders with "jump" but not sure if they relate to the four datasets mentioned in the README.
My assumption is that there should be four directories with names cpg0000-jump, cpg0002-jump, cpg0002-jump and cpg0016 as the principle dataset. However, I see cpg0000-jump-pilot/, cpg0002-jump-scope/, cpg0014-jump-adipocyte/, cpg0016-jump-fixed/, cpg0016-jump/, dev-cpg0016-jump/, test-cpg0016-jump/ and jump/.("jump/" is not accessible)
Can you please clarify which folders correspond to the four jump datasets and how can i only download the images please?

Thank you!

Positive controls identifiers

Hi all,

I spent some time locating in cpg0016 the positive controls documented here , and I think I found a few issues that might be worth flagging here.

First of all, out of the 8 InChI keys documented in Target-2 as positive controls, only 3 of them seem to be present in compound.csv.gz. Specifically: AMG900, LY2109761, TC-S-7004. The main reason for the mismatch seems to be that that the compound InChIs and InChIKeys provided in compound.csv.gz do not contain stereochemical information, so matching with only the first layer of the InChIKey solves the problem for 4 more compounds, specifically: NVS-PAK1-1, FK-866, quinidine, aloxistatin.

However, dexamethasone can't be resolved with simple matching. So I looked into the metadata for TARGET-2 plates at wells H24 and K02 (where dexamethasone is supposed to be, as per Target-2 platemap and metadata), and found the following InChIKey in most plates: GJFCONYVAUNLKB-UHFFFAOYSA-N. Only some plates from sources 7 and 9 seem to contain other compounds.

So, here are a few questions:

Would it be worth harmonizing the Target-2 documentation to clarify the InChIKeys that are actually used in the main JUMP dataset?
Could someone clarify the mismatch on dexamethasone? GJFCONYVAUNLKB-UHFFFAOYSA-N points to a compound (pubchem link) that is not really dexamethasone (UREBDLICKHMUKA-CXSFZGCWSA-N, Tanimoto coefficient ~=0.9), so I am wondering if this is an error in the metadata or if this was intentionally a different compound.
It seems like some TARGET-2 plates from sources 7 and 9 don't follow the expected Target-2 layout. Is this a known issue?

Many thanks.

Data source in other cell lines

It seems that all data listed in the repo are from U2OS or A549 cell lines. Are there any data with compound perturbation on other cell lines?

Provide information about missing files

I am trying to download all images for source_11 that I can find in the respective load_data_with_illum.parquet files. I found that for these parquet files,

['cpg0016-jump/source_11/workspace/load_data_csv/Batch2/EC000038/load_data_with_illum.parquet',
 'cpg0016-jump/source_11/workspace/load_data_csv/Batch2/EC000066/load_dat
[source_11-404.csv](https://github.com/jump-cellpainting/datasets/files/12325106/source_11-404.csv)
a_with_illum.parquet',
 'cpg0016-jump/source_11/workspace/load_data_csv/Batch2/EC000070/load_data_with_illum.parquet']

there are 1216 fields/sites with at least one missing image, for a total of 6068 missing images that I attached as CSV in source_11-404.txt (I had to change the extension from txt to csv to attach in this comment). This is what the CSV looks like:

$ head source_11-404.txt
failed-paths
cpg0016-jump/source_11/images/Batch2/images/EC000038__2021-06-04T17_37_00-Measurement1/Images/r11c10f03p01-ch2sk1fk1fl1.tiff
cpg0016-jump/source_11/images/Batch2/images/EC000038__2021-06-04T17_37_00-Measurement1/Images/r11c10f03p01-ch4sk1fk1fl1.tiff
cpg0016-jump/source_11/images/Batch2/images/EC000038__2021-06-04T17_37_00-Measurement1/Images/r11c10f03p01-ch3sk1fk1fl1.tiff
cpg0016-jump/source_11/images/Batch2/images/EC000038__2021-06-04T17_37_00-Measurement1/Images/r11c10f03p01-ch5sk1fk1fl1.tiff
cpg0016-jump/source_11/images/Batch2/images/EC000038__2021-06-04T17_37_00-Measurement1/Images/r11c10f04p01-ch1sk1fk1fl1.tiff
cpg0016-jump/source_11/images/Batch2/images/EC000038__2021-06-04T17_37_00-Measurement1/Images/r11c10f04p01-ch2sk1fk1fl1.tiff
cpg0016-jump/source_11/images/Batch2/images/EC000038__2021-06-04T17_37_00-Measurement1/Images/r11c10f04p01-ch4sk1fk1fl1.tiff
cpg0016-jump/source_11/images/Batch2/images/EC000038__2021-06-04T17_37_00-Measurement1/Images/r11c10f04p01-ch3sk1fk1fl1.tiff
cpg0016-jump/source_11/images/Batch2/images/EC000038__2021-06-04T17_37_00-Measurement1/Images/r11c10f04p01-ch5sk1fk1fl1.tiff

For example, aws s3 ls on the first file returns in above snippet exits with code 1, i.e. the key does not exist:

$ aws s3 --no-sign-request ls s3://cellpainting-gallery/cpg0016-jump/source_11/images/Batch2/images/EC000038__2021-06-04T17_37_00-Measurement1/Images/r11c10f03p01-ch2sk1fk1fl1.tiff

$ echo $?
1

When I use the same key but change the channel from ch2 to ch1, that file exists:

$ aws s3 --no-sign-request ls s3://cellpainting-gallery/cpg0016-jump/source_11/images/Batch2/images/EC000038__2021-06-04T17_37_00-Measurement1/Images/r11c10f03p01-ch1sk1fk1fl1.tiff
2022-12-21 21:35:43    2058750 r11c10f03p01-ch1sk1fk1fl1.tiff

$ echo $?
0

I will double-check that I inferred the correct file names from the parquet files. The existence of ch1 in this example suggests that I inferred the correct names, at least for that field/site.

To find the number of missing fields/sites, I removed the channel sub-string:

$ cat notebooks/data/source_11-404.txt | sed 's/-ch[0-9]sk1fk1fl1.tiff//' | sort | uniq -c | wc -l
1217

Subtract 1 for the CSV header.

Upload arm-centering-corrected CRISPR profiles

Let's deposit the profiles here:

cellpainting-gallery/
└── cpg0016-jump
    └── source_13
        └── workspace_dl
            └── consensus

Questions about cpg0016 metadata

Thank you for sharing this great dataset! It’s very useful for our work! In the process of using this dataset, I have a few questions:

Where can we find a detailed explanation of each category in the plate.csv especially like Metadata_PlateType column?
In ORF dataset, how to understand the two value of trt and control in Metadata_pert_type column?
Is there any healthy control group or disease control group in compound.csv / orf.csv ?
Can I consider the cells treated with orf as the cells in disease state? And I would like to know which part of the data corresponds to the cells in healthy state without any treatment?
Will the compound dose information be supplemented in compound.csv?

I am looking forward to your answer！Thank you very much for your help !

What is the rows with missing gene names in the ORF dataset?

Another low-pri question for you @shntnu (-- and please let me know if I should be directing these elsewhere/to somebody else):

We see a bunch of rows that have no gene information attached to them. They do have an ORF "name" and a sample ID, and I can't seem to figure out what they mean:

Here are some examples:

In [61]: merged[merged["Metadata_Taxon_ID"].isnull()][
    ...:     ["Metadata_JCP2022", "Metadata_broad_sample", "Metadata_Name", "Metadata_Symbol"]
    ...: ].drop_duplicates().head()
Out[61]: 
      Metadata_JCP2022   Metadata_broad_sample         Metadata_Name  \
2401      JCP2022_900001         BAD CONSTRUCT         BAD CONSTRUCT   
24263     JCP2022_913554    ccsbBroad304_14521  ORF004451.1_TRC304.1   
24653     JCP2022_912829    ccsbBroad304_13762  ORF003953.1_TRC304.1   
26714     JCP2022_912852    ccsbBroad304_13786  ORF003883.1_TRC304.1   
30872     JCP2022_912857    ccsbBroad304_13791  ORF000206.1_TRC304.1   

      Metadata_Symbol  
2401              NaN  
24263             NaN  
24653             NaN  
26714             NaN  
30872             NaN

(You've already mentioned the ORFs that are flagged as bad constructs, but there are many more that are missing gene information generally)

what's the meaning of "source"?

what's the meaning of "source"? is it means different labs/partners?

How does one identify replicates?

Hi, am I correct in thinking that if two datapoints in the well.csv.gz share the same 'Metadata_JCP2022' identifier then they are replicates of one another? I guess this excludes 'JCP2022_999999' which I think represents non-Compound perturbations.

If this is the case, most compound ids seem to have 3, 4 or 5 replicates within the well-level data, however some have magnitudes more than that - for example 'JCP2022_037716' which appears to have 9,099 associated datapoints as per the latest 'well.csv.gz' file (see image below).

Is this the correct way to think about the identifiers, and if so why do some have so many replicates within the dataset?

Thank you in advance for your help,
Will

Some images seem to be missing in the first batches of source 4 (cpg0016).

Hello, some images might be missing in the first batches.

Example:

aws s3 cp --no-sign-request s3://cellpainting-gallery/cpg0016-jump/source_4/images/2021_06_07_Batch5/images/BR00123962__2021-06-17T04_57_17-Measurement1/Images/r16c04f01p01-ch4sk1fk1fl1.tiff r16c04f01p01-ch4sk1fk1fl1.tiff
fatal error: An error occurred (404) when calling the HeadObject operation: Key "cpg0016jump/source_4/images/2021_06_07_Batch5/images/BR00123962__2
021-06-17T04_57_17-Measurement1/Images/r16c04f01p01-ch4sk1fk1fl1.tiff" does not exist

This file is enlisted though in load_data_with_illum.parquet for this plate.

Could you please check that images are in place? Especially for the first five batches.
Unfortunately I don't know how to check all those examples quickly source-wide.

Exact commit hashes for Cellprofiler pipelines used in JUMP production

Is there documentation for the exact version and commit hash of the cellprofiler pipelines that were used by each source for analysis? https://github.com/broadinstitute/imaging-platform-pipelines/tree/master/JUMP_production#production-pipelines is linked in the paper but there are two pipeline (JUMP_analysis_v2.cppipe and JUMP_analysis_v3.cppipe). Did all sources use the same pipeline (v3?) at the same commit hash?

Thank you

Will published datasets be somehow updated? Thanks!

Hi there,

Thanks for all the data you've shared and the maintaining effort behind :)

Just a quick check: may I know if the published datasets could be rolling updated? For example, will new files append to or replace the existing ones under the published folders (e.g. sources)?

Thanks and wish your team all the best.
Nino

Unable to replicate CellProfiler pipeline output

I am attempting to use JUMP_production/JUMP_analysis_v3.cppipe to replicate the output for site: source_2, batch: 20210614_Batch_1 and plate: 1053600674 found in cpg0016-jump/source_2/workspace/analysis/20210614_Batch_1/1053600674/analysis/1053600674/.

However, when I run the pipeline the IdentifyPrimaryObjects module find only a couple of hundred objects across the whole plate.

My setup is as follows:

Download the plate images;
Download the plate illumination corrections; and
Download the load_data_with_illumn CSV.

I modified the load data CSV to contain brightfield images, I propagated the following pattern:

FileName_OrigBrightfield: 1053600674_A01_T0001F001L01A06Z01C06.tif;
FileName_OrigBrightfield_L: ,1053600674_A01_T0001F001L01A06Z02C06.tif; and
FileName_OrigBrightfield_H: 1053600674_A01_T0001F001L01A06Z03C06.tif

Where I updated the well coordinate to be appropriate for the row. I don't know if these are correct however, I don't believe they are causing the issue I'm seeing. This relates to datasets/issues/79.

I modified the load data CSV to use my local paths rather than S3. This speeds up local tests however, I have run against S3 with the same results. Basically no objects found.

I believe in this is related to the illumination correction files. I have run JUMP_production/JUMP_QC_LoadData_v1.cppipe and that finds about 1/3 of the objects compared to the published results. As this generates it's own illumination correction files I did a quick test with the V3 pipeline using the OrigDNA rather than CorrBlue for the IdentifyPrimaryObjects module then the V3 pipeline finds a similar number of objects to the QC pipeline.

I have tried running JUMP_production/JUMP_illum_LoadData_v1.cppipe to generate the illumination correction files locally however, the produced files are the wrong shape. I get the error:

Error while processing CorrectIlluminationApply:
This module requires that the image and illumination function have equal dimensions.
The OrigDNA image and IllumDNA illumination function do not ((996, 996) vs (995, 995)).
If they are paired correctly you may want to use the Resize or Crop module to make them the same size.

I ran that pipeline with my modified load data file (without illum columns).

Our intention is to use the pipeline for our own data and the first step was to replicate the JUMP results to gain confidence with the pipeline and CellProfiler.

Do you have any advice on how to proceed?

What does "ORF" mean?

Thanks for your work and sharing of the dataset! It's a very useful resource. But it confuses me when I met the genetic perturbation, labelling as ORF (trt or control). It means it's not CRISPR, right? But is it overexpression, shRNA knockdown or knockout? I wonder is it possible to treat the "trt" of this "ORF" as one of the perturbagens in ConnectivityMap?
In the JUMP-Target-ORF repository, it mentions it's overexpression, but that repository has only 175 genes. I guess it is not equal to this dataset?

What is the timeline for uploading the remaining data?

the JUMP dataset is still in updating, when will the complete JUMP dataset be available?

Only .parquet files in profiles directory

We noticed that the expected workspace folder structure for profiles (https://github.com/broadinstitute/cellpainting-gallery/blob/main/folder_structure.md#profiles-folder-structure), i.e.:

└── profiles
    └── 2021_04_26_Batch1
        ├── BR00117035
        │   ├── BR00117035.csv.gz
        │   ├── BR00117035_augmented.csv.gz
        │   ├── BR00117035_normalized.csv.gz
        │   ├── BR00117035_normalized_feature_select_negcon_plate.csv.gz
        │   ├── BR00117035_normalized_feature_select_plate.csv.gz
        │   └── BR00117035_normalized_negcon.csv.gz
        └── BR00117036

are actually directories of single parquet files (similar to the ones expected in workspace_dl https://github.com/broadinstitute/cellpainting-gallery/blob/main/folder_structure.md#profiles-folder-structure-1). Is this expected or does folder_structure.md need updating?

Many thanks for any help!

Will metadata associated with the perturbations be released?

Hello,

How can the compound ID like JPC2022_000051 be translated into a more useful information where compound target or MoA is available? As far as I am aware, the InChIKey or InChI could be used for such a translation, but what tools could be used for this?

Thank you,
Martin

Verify whether a plate in cpg0016/source_7 are rotated

From #77 (comment)

For what concerns source 7, it seems like differences are limited to plate CP1-SC2-25, which is a 384 well plate. I looked a bit better into the plate and I think that the layout in this case is mirrored (e.g. A01 is actually in P23, H12 in I13, etc.), so that's why I couldn't get the expected compounds using the original plate map. I am not sure if this should be considered a problem or not.

How does one load pilot data sets?

Just wanted to ask how would one load in the pilot data sets (cpg0000-jump-pilot, cpg0001-cellpainting-protocol, and cpg0002-jump-scope)? How would the that differ from the provided sample notebook? Is there a setup in which the images are handled by a PyTorch data loader? Thanks in advance!

How to obtain ORF features

May I ask how the data containing 1477 features is obtained for the Umap map drawn by analyzing ORF data in the article? What is the relationship between the data containing 1477 features and the data containing 7648 features or 4762 features? Are these 1,477 features derived from 7,648 features processed?
Thanks for answers.

Question regarding the planned profile normalization for cpg0016

Dear JUMP-CP team,

thanks a lot for the great data resource.
I was wondering if you could briefly comment on the planned normalization of the profiles to (partially) mitigate batch effects for the ORF data. Since the treatment conditions are only partially repeated, I assume you plan to normalize the profiles with respect to the batch-specific (respectively plate-specific) negative controls as described here, right? Which negative control setting would you use for that - BFP, eGFP, HcRed, Luciferase or LacZ?

Thanks a lot in advance for your help!

Provide prefix sizes

I think it would be helpful to provide a breakdown of data sizes by source/numerical data/image data so people have an idea of what they're getting into before downloading without having to list the bucket themselves.

I'm not sure how much is still in flux, but our dashboard auto-calculated these prefixes current as of right now. I'm happy to flesh out/update.

source	images size (TB)	workspace size (TB)	workspace_dl size (TB)	total size (TB)
1				13.2
2	7.6	10.8		21.6
3	16.6	20.6		42.5
4	17.6	17.3		39.1
5	13.1	32	7.4
6	11.7	25.8		43.7
7				14.9
8	7.2	12.1		24.4
9	9.2	17.8	7.1
10	7.5	11.3		21.6
11		10.3		21.6
13		15.8	6.8

Source of truth for all images with perturbation labels

I am currently preparing JUMP for our image processing pipeline. We are mostly interested in all images plus perturbation labels for each wells. What is the source of truth for all wells in the dataset? I was able to find some sort of metadatafile (Index.idx.xml, indexfile.txt, MeasurementData.mlf) in the images prefix for all plates except for sources 7 and 8. I use that to create my own metadata table and join that with metadata/well.csv.gz for well treatment labels.

Now I found load_data_csv that may actually be a better source for the metadata for all plates except (I did not check plates 7 and 8 yet):

[('source_3', 'C13451bW'),
 ('source_3', 'C13451dW'),
 ('source_3', 'C13495dW'),
 ('source_3', 'J12440d'),
 ('source_3', 'SP16P19c'),
 ('source_3', 'SP24P27c'),
 ('source_3', 'SP24P27d')]

The sample_notebook.ipynb uses load_data_with_illum.parquet. I ran the same analysis for the parquet files and found that the same plates are missing for parquet.

Now I am thinking that I should use metadata/plate.csv.gz to identify all plates, then find the according load_data_with_illum.parquet file for each plate, and download the data that way. Is this the preferred way to download/process the images?

Resolve inconsistencies in Target2 Compound InChIKeys

Hi all,

As a follow up from #77, I have been trying to map compound identifiers mentioned in the Target-2 plate map and metadata with compound identifiers provided for Target-2 plates in the JUMP metadata files.
As a result, I found 36 (out of 384) wells for which the compound in the JUMP metadata doesn't match the Target-2 metadata:

Well	InChI Expected	InChI Found
A03	KRGQEOSDQHTZMX-IGCYCDGOSA-N	LPYXWGMUVRGUOY-UHFFFAOYSA-N
A06	ODHCTXKNWHHXJC-VKHMYHEASA-N	GUUGZPSUOTWOMD-UHFFFAOYSA-N
A12	NSFFHOGKXHRQEW-DVRIZHICSA-N	UTBOEBCWXGDOGI-UHFFFAOYSA-N
B01	LLPBUXODFQZPFH-UHFFFAOYSA-N	AJVXVYTVAAWZAP-UHFFFAOYSA-N
B05	CVOUSAVHMDXCKG-UHFFFAOYSA-N	ROBYKNONIPZMTK-UHFFFAOYSA-N
B24	QTQAWLPCGQOSGP-PHLMVCJGSA-N	HGMSUJCQIUFZBJ-UHFFFAOYSA-N
C13	CXJCGSPAPOTTSF-VURMDHGXSA-N	DXZRBHUCOHBAHP-UHFFFAOYSA-N
C24	HTIQEAQVCYTUBX-UHFFFAOYSA-N	YMDXSGBNCBQYGC-UHFFFAOYSA-N
D02	LXENKEWVEVKKGV-BQYQJAHWSA-N	VSVFLGPUZJTBSD-UHFFFAOYSA-N
D08	BMKPVDQDJQWBPD-UHFFFAOYSA-N	DUKQPWDVIZDABV-UHFFFAOYSA-N
E04	RZTAMFZIAATZDJ-UHFFFAOYSA-N	HAQDEJPEAKWAAM-UHFFFAOYSA-N
F18	KOCVKGYKBLJEPK-LYBHJNIJSA-N	WRLVHADVOGFZOZ-UHFFFAOYSA-N
F23	KVWDHTXUZHCGIO-UHFFFAOYSA-N	WXPNDRBBWZMPQG-UHFFFAOYSA-N
G05	WBGKWQHBNHJJPZ-LECWWXJVSA-N	ZMUSCGJNJYXJBP-UHFFFAOYSA-N
G06	POJZIZBONPAWIV-UHFFFAOYSA-N	GQXSULRYFDAMOO-UHFFFAOYSA-N
G15	VYMDGNCVAMGZFE-UHFFFAOYSA-N	PKYKNPLSFOKASK-UHFFFAOYSA-N
H24	UREBDLICKHMUKA-QCYOSJOCSA-N	GJFCONYVAUNLKB-UHFFFAOYSA-N
I06	KGPGQDLTDHGEGT-SZUNQUCBSA-N	LQERMDXPGNOJCT-UHFFFAOYSA-N
I18	VDJHFHXMUKFKET-WDUFCVPESA-N	HULPONUAINYLQQ-UHFFFAOYSA-N
J02	NSFFHOGKXHRQEW-AIHSUZKVSA-N	UTBOEBCWXGDOGI-UHFFFAOYSA-N
J07	XKFTZKGMDDZMJI-HSZRJFAPSA-N	KRBSMMVJJVHVCB-UHFFFAOYSA-N
J14	XKFTZKGMDDZMJI-HSZRJFAPSA-N	KRBSMMVJJVHVCB-UHFFFAOYSA-N
K02	UREBDLICKHMUKA-CXSFZGCWSA-N	GJFCONYVAUNLKB-UHFFFAOYSA-N
K05	GIUYCYHIANZCFB-FJFJXFQQSA-N	CAOWNCTTWGSKDO-UHFFFAOYSA-N
K13	SJFBTAPEPRWNKH-CCKFTAQKSA-N	XUZQTIZWMHMWOC-UHFFFAOYSA-N
L06	OHRURASPPZQGQM-GCCNXGTGSA-N	SOOPLNPQGWJZHY-UHFFFAOYSA-N
L11	NHFDRBXTEDBWCZ-ZROIWOOFSA-N	GMROZDPZEUVIGD-UHFFFAOYSA-N
N09	MBGGBVCUIVRRBF-UHFFFAOYSA-N	AUMHDRMJJNZTPB-UHFFFAOYSA-N
N14	HTSLEZOTMYUPLU-UHFFFAOYSA-N	AGNWVEJTZJIJIM-UHFFFAOYSA-N
O10	DEQANNDTNATYII-UHFFFAOYSA-N	JDKKNQACNITFEA-UHFFFAOYSA-N
O14	FAIIFDPAEUKBEP-UHFFFAOYSA-N	KJWGEXJCWCYEMI-UHFFFAOYSA-N
P01	AOZPVMOOEJAZGK-UHFFFAOYSA-N	UXUQIRNFBFRPAC-UHFFFAOYSA-N
P03	HFPLHASLIOXVGS-UHFFFAOYSA-N	CANBMWXJDLUDFF-UHFFFAOYSA-N
P12	UIAGMCDKSXEBJQ-UHFFFAOYSA-N	SVMHYHIZWOJKDL-UHFFFAOYSA-N
P18	ZDXUKAKRHYTAKV-UHFFFAOYSA-N	PHOGQKDIVUJGMJ-UHFFFAOYSA-N
P23	YYDUWLSETXNJJT-MTJSOVHGSA-N	LNFZRMDSZJCZTG-UHFFFAOYSA-N

As you can see, the first layer of the InChIKey is different, so the mismatch shouldn't be due to just missing stereochemical information.
Note that each row of the table applies to all of the TARGET2 plates described in the metadata files except for those coming from source_9 (I excluded these from my analysis code because they have a 1536 well layout and I wanted to keep things simple, see #77) and plate CP1-SC2-25 from source_7 (similarly, because it seems like the plate has a mirrored layout, see #77). So I ran this check on 131 plates, and for all of them I can find the differences described in the table above.

Any idea on whether the compounds used in the experiments are actually different, as suggested by the InChIKeys, or whether there is some issue in the metadata files provided in this repo?

Weird batch effects in source_1

Alex Lu said

So - I'm doing an analysis of the CP JUMP data, and I noticed that in the CellProfiler feature space, the plates for Batch6_20221102 (Source_1) have a really weird variation structure. I took a look at some images, and looks like the contrast is just having really weirdly across images to the point that some have completely different color histograms and others aren't even recognizable as cells (left is a perturbation, right is the closest control; coloring scheme is arbitrary because I just threw this together rapidly).

Do you know what's going on with these plates? Should I just toss them all out?

Shantanu said

I looked up our notes and it appears that there are some weird plates in batch 6, but then that was also true for batch 5. I'm not sure why all of batch 6 is weird and why some of the bad plates were included; to be investigated.

I'd recommend just tossing those plates out for now.

Thank you for reporting this. Is it okay if we post this observation on GitHub so we can point others to it?

Alex said

Please do! For posterity, here's the analysis I caught this with - basically, I was looking at the cosine distance between each compound perturbation and its spatially nearest control in the CellProfiler feature space. The y-axis is the wells in order of their occurrence across plates/batches. You can see most wells have pretty constrained variation in their distances - and then you get to Batch 6 and it's just completely weird.

I can't see anything out-of-the-unusual with Batch 5, so it seems to be primarily localized to Batch 6, at least from my view?

Shantanu said

Wow that’s pretty obvious!

Thanks a lot for reporting the details, Alex! This will make it’s way into GitHub soon

Technical confounders in cpg0016

Hi all:
I noticed that in the dataset related to ORF, which is the source_4 of cpg0016, over 7000 features and over 4000 features were analyzed, revealing batch effects between different batches. However, the article analyzed over 1400 features and obtained results without batch effects. In question #88, you mentioned updating the process of obtaining features. We are considering whether the process of obtaining these 1400 features is reliable, and whether the 1400 data can be used as reliable data for further analysis.
Looking forward to your answer, thank u!

Clarify why some compounds have multiple replicates

When I'm counting the replicates of each compound in the COMPOUND plates, I have a few questions:

The top ten compounds have >6000 replicates. Among them are DMSO, the empty well (JCP2022_999999), and 8 positive controls. However, when I compare the InChIKey of the 8 positive controls with those given in https://github.com/jump-cellpainting/JUMP-Target/tree/master#positive-control-compounds, one of them disagrees: JCP2022_025848 (GJFCONYVAUNLKB-UHFFFAOYSA-N) has 8127 replicates but is not listed as a positive control; dexamethasone (UREBDLICKHMUKA-CXSFZGCWSA-N) listed as a positive control doesn't appear in the metadata compound.csv.gz.
The 11th-ranked compound JCP2022_033954 has 1594 replicates. Is it also a positive control or what is it aiming for?
There are many compounds with multiple replicates (for example over 10 but less than 60). Why do they have much more replicates than the common case as mentioned in the paper (i.e. about 5)?

Thanks!

Are all compounds/genes from cpg0000 replicated in cpg0016?

I see cpg0016 has a 119 TARGET2 plates, and 4 TARGET1 plates. But when I query by InChIKey there are only 182 compounds in common.

>>> compound = pd.read_csv("metadata/compound.csv.gz")
>>> pilot_md = pd.read_csv('JUMP-Target/JUMP-Target-2_compound_metadata.tsv', sep='\t')
>>> len(set(pilot_md['InChIKey']) & set(compound['Metadata_InChIKey']))
182

So it seems that 124 compounds are not being identified. Are the cpg0016 TARGET1/2 plates different form the cpg0000 target1/2 plates?

As for ORFs, I found that only 'HPGDS', 'HRH4', 'KCNJ1', 'KCNN4', 'MME', 'MMP2', 'S1PR2' were present in cpg0000 but missing from cpg0016. Were some of the cpg0016 plates JUMP-Target-ORF plates?
Thanks!

Some JCP labels may be missing in compounds.csv.gz

I am currently downloading metadata and images following #72 and then match the metadata from load_data_with_illum.parquet with the contents of well.csv.gz for the JCP2022 id and then double check that JCP2022 id is valid via compound.csv.gz, crispr.csv.gz, and orf.csv.gz. I found that JCP2022_028373 is in well.csv.gz but not in any of compound.csv.gz, crispr.csv.gz, .orf.csv.gz. This is how you can reproduce it:

In [76]: well = pd.read_csv('well.csv.gz')

In [77]: compound = pd.read_csv('compound.csv.gz')

In [78]: crispr = pd.read_csv('crispr.csv.gz')

In [79]: orf = pd.read_csv('orf.csv.gz')

In [80]: all_jcps = pd.concat([x[['Metadata_JCP2022']] for x in [compound, crispr, orf]])

In [81]: joined = well[['Metadata_JCP2022']].drop_duplicates().merge(all_jcps.assign(x=all_jcps.Metadata_JCP2022), 'left')

In [82]: joined[joined.x.isnull()]
Out[82]:
       Metadata_JCP2022    x
83013   JCP2022_UNKNOWN  NaN
135738   JCP2022_028373  NaN

As far as I can tell, all other JCPs in well.csv.gz have an according entry in one of the JCP files. I will treat this like JCP2022_UNKNOWN. Is there any chance that this missing JCP will be added to the metadata in the future?

cpg0016, source 7, sqlite files are of size 0

For source 7 well-level profiles exist, though SQLite database files are of size 0,

How to get the channel-to-stain mapping per source?

Hello, firstly thank you for sharing all the data, it's an amazing resource to have access to!

Looking through the raw image files on s3 its hard to be sure which channels correspond to each file. For example, in sources 3 and 4 the nuclei channel appears to be under images labelled with 'ch5', whereas source 5 it appears to be 'C01' that corresponds to the nuclei channel.

Are you able to share information like the following:
Source X:
Channel 1 - Nuclei
Channel 2 - AGP
... etc

Thank you very much for your help!
Will

question of cpg0012

we didn't found the compound data in cpg0012, where shall we get the compounds?

cpg0016, source 6, sqlite file for one of the plates is of size 0

SQLite file cpg0016-jump/source_6/workspace/backend/p211123CPU2OS48hw384exp036JUMP/110000297123/110000297123.sqlite is of size 0, well-level profiles exist for this plate.

Provide instructions for downloading images

I’m pretty new to boto3 and mostly following the template in the Jupyter notebook where images are downloaded one-by-one. This mostly seems ok download speed wise if I parallelize with multiple workers, but it seems to have a nasty habit of hanging at times and needing a reset – I’m not sure if I’m being throttled trying to access files systematically like this. Let me know if I am doing something terrible that I should not be doing, and if you’d have any better guidance.

We recommend doing this:

https://github.com/jump-cellpainting/2023_Chandrasekaran_submitted#step-1-download-cell-images

Report corrupt TIFF files, filter load_data where images are actually missing

I found a few corrupt tiff files in the JUMP production dataset. So far, I have only seen corrupt tiff files in sources 1 and 7 (4 files each). I will report back any additional corrupt tiff files that I may find during my download/conversion.

Here is what I have so far:

s3://cellpainting-gallery/cpg0016-jump/source_1/images/Batch3_20221010/images/UL000087__2022-10-11T10_57_24-Measurement1/Images/r03c04f01p01-ch1sk1fk1fl1.tiff
s3://cellpainting-gallery/cpg0016-jump/source_1/images/Batch3_20221010/images/UL000087__2022-10-11T10_57_24-Measurement1/Images/r04c18f02p01-ch4sk1fk1fl1.tiff
s3://cellpainting-gallery/cpg0016-jump/source_1/images/Batch3_20221010/images/UL000087__2022-10-11T10_57_24-Measurement1/Images/r04c19f02p01-ch1sk1fk1fl1.tiff
s3://cellpainting-gallery/cpg0016-jump/source_1/images/Batch3_20221010/images/UL000087__2022-10-11T10_57_24-Measurement1/Images/r04c37f04p01-ch4sk1fk1fl1.tiff
s3://cellpainting-gallery/cpg0016-jump/source_3/images/CP60/images/BR5876c3__2022-04-29T20_47_20-Measurement 1/Images/r11c22f08p01-ch3sk1fk1fl1.tiff
s3://cellpainting-gallery/cpg0016-jump/source_7/images/20210727_Run3/images/CP3-SC1-18/CP3-SC1-18_I22_T0001F003L01A03Z01C04.tif
s3://cellpainting-gallery/cpg0016-jump/source_7/images/20210727_Run3/images/CP3-SC1-18/CP3-SC1-18_I22_T0001F004L01A01Z01C01.tif
s3://cellpainting-gallery/cpg0016-jump/source_7/images/20210727_Run3/images/CP3-SC1-18/CP3-SC1-18_I22_T0001F004L01A02Z01C03.tif
s3://cellpainting-gallery/cpg0016-jump/source_7/images/20210727_Run3/images/CP3-SC1-18/CP3-SC1-18_I22_T0001F004L01A01Z01C02.tif

How to confirm that these files are corrupt:

$ urls=(
s3://cellpainting-gallery/cpg0016-jump/source_1/images/Batch3_20221010/images/UL000087__2022-10-11T10_57_24-Measurement1/Images/r03c04f01p01-ch1sk1fk1fl1.tiff
s3://cellpainting-gallery/cpg0016-jump/source_1/images/Batch3_20221010/images/UL000087__2022-10-11T10_57_24-Measurement1/Images/r04c18f02p01-ch4sk1fk1fl1.tiff
s3://cellpainting-gallery/cpg0016-jump/source_1/images/Batch3_20221010/images/UL000087__2022-10-11T10_57_24-Measurement1/Images/r04c19f02p01-ch1sk1fk1fl1.tiff
s3://cellpainting-gallery/cpg0016-jump/source_1/images/Batch3_20221010/images/UL000087__2022-10-11T10_57_24-Measurement1/Images/r04c37f04p01-ch4sk1fk1fl1.tiff
s3://cellpainting-gallery/cpg0016-jump/source_3/images/CP60/images/BR5876c3__2022-04-29T20_47_20-Measurement\ 1/Images/r11c22f08p01-ch3sk1fk1fl1.tiff
s3://cellpainting-gallery/cpg0016-jump/source_7/images/20210727_Run3/images/CP3-SC1-18/CP3-SC1-18_I22_T0001F003L01A03Z01C04.tif
s3://cellpainting-gallery/cpg0016-jump/source_7/images/20210727_Run3/images/CP3-SC1-18/CP3-SC1-18_I22_T0001F004L01A01Z01C01.tif
s3://cellpainting-gallery/cpg0016-jump/source_7/images/20210727_Run3/images/CP3-SC1-18/CP3-SC1-18_I22_T0001F004L01A02Z01C03.tif
s3://cellpainting-gallery/cpg0016-jump/source_7/images/20210727_Run3/images/CP3-SC1-18/CP3-SC1-18_I22_T0001F004L01A01Z01C02.tif
)

$ for url in "${urls[@]}"; do aws s3 --no-sign-request cp $url .; done
download: s3://cellpainting-gallery/cpg0016-jump/source_1/images/Batch3_20221010/images/UL000087__2022-10-11T10_57_24-Measurement1/Images/r03c04f01p01-ch1sk1fk1fl1.tiff to ./r03c04f01p01-ch1sk1fk1fl1.tiff
download: s3://cellpainting-gallery/cpg0016-jump/source_1/images/Batch3_20221010/images/UL000087__2022-10-11T10_57_24-Measurement1/Images/r04c18f02p01-ch4sk1fk1fl1.tiff to ./r04c18f02p01-ch4sk1fk1fl1.tiff
download: s3://cellpainting-gallery/cpg0016-jump/source_1/images/Batch3_20221010/images/UL000087__2022-10-11T10_57_24-Measurement1/Images/r04c19f02p01-ch1sk1fk1fl1.tiff to ./r04c19f02p01-ch1sk1fk1fl1.tiff
download: s3://cellpainting-gallery/cpg0016-jump/source_1/images/Batch3_20221010/images/UL000087__2022-10-11T10_57_24-Measurement1/Images/r04c37f04p01-ch4sk1fk1fl1.tiff to ./r04c37f04p01-ch4sk1fk1fl1.tiff
download: s3://cellpainting-gallery/cpg0016-jump/source_3/images/CP60/images/BR5876c3__2022-04-29T20_47_20-Measurement 1/Images/r11c22f08p01-ch3sk1fk1fl1.tiff to ./r11c22f08p01-ch3sk1fk1fl1.tiff
download: s3://cellpainting-gallery/cpg0016-jump/source_7/images/20210727_Run3/images/CP3-SC1-18/CP3-SC1-18_I22_T0001F003L01A03Z01C04.tif to ./CP3-SC1-18_I22_T0001F003L01A03Z01C04.tif
download: s3://cellpainting-gallery/cpg0016-jump/source_7/images/20210727_Run3/images/CP3-SC1-18/CP3-SC1-18_I22_T0001F004L01A01Z01C01.tif to ./CP3-SC1-18_I22_T0001F004L01A01Z01C01.tif
download: s3://cellpainting-gallery/cpg0016-jump/source_7/images/20210727_Run3/images/CP3-SC1-18/CP3-SC1-18_I22_T0001F004L01A02Z01C03.tif to ./CP3-SC1-18_I22_T0001F004L01A02Z01C03.tif
download: s3://cellpainting-gallery/cpg0016-jump/source_7/images/20210727_Run3/images/CP3-SC1-18/CP3-SC1-18_I22_T0001F004L01A01Z01C02.tif to ./CP3-SC1-18_I22_T0001F004L01A01Z01C02.tif

$ du -hs *tif *tiff
2.7M    CP3-SC1-18_I22_T0001F003L01A03Z01C04.tif
2.7M    CP3-SC1-18_I22_T0001F004L01A01Z01C01.tif
2.7M    CP3-SC1-18_I22_T0001F004L01A01Z01C02.tif
2.7M    CP3-SC1-18_I22_T0001F004L01A02Z01C03.tif
3.1M    r03c04f01p01-ch1sk1fk1fl1.tiff
2.8M    r04c18f02p01-ch4sk1fk1fl1.tiff
3.1M    r04c19f02p01-ch1sk1fk1fl1.tiff
2.6M    r04c37f04p01-ch4sk1fk1fl1.tiff
0       r11c22f08p01-ch3sk1fk1fl1.tiff

$ identify *tif *tiff
identify: Not a TIFF or MDI file, bad magic number 0 (0x0). `CP3-SC1-18_I22_T0001F003L01A03Z01C04.tif' @ error/tiff.c/TIFFErrors/599.
identify: Not a TIFF or MDI file, bad magic number 0 (0x0). `CP3-SC1-18_I22_T0001F004L01A01Z01C01.tif' @ error/tiff.c/TIFFErrors/599.
identify: Not a TIFF or MDI file, bad magic number 0 (0x0). `CP3-SC1-18_I22_T0001F004L01A01Z01C02.tif' @ error/tiff.c/TIFFErrors/599.
identify: Not a TIFF or MDI file, bad magic number 0 (0x0). `CP3-SC1-18_I22_T0001F004L01A02Z01C03.tif' @ error/tiff.c/TIFFErrors/599.
identify: Not a TIFF or MDI file, bad magic number 0 (0x0). `r03c04f01p01-ch1sk1fk1fl1.tiff' @ error/tiff.c/TIFFErrors/599.
identify: Not a TIFF or MDI file, bad magic number 0 (0x0). `r04c18f02p01-ch4sk1fk1fl1.tiff' @ error/tiff.c/TIFFErrors/599.
identify: Not a TIFF or MDI file, bad magic number 0 (0x0). `r04c19f02p01-ch1sk1fk1fl1.tiff' @ error/tiff.c/TIFFErrors/599.
identify: Not a TIFF or MDI file, bad magic number 0 (0x0). `r04c37f04p01-ch4sk1fk1fl1.tiff' @ error/tiff.c/TIFFErrors/599.
identify: Cannot read TIFF header. `r11c22f08p01-ch3sk1fk1fl1.tiff' @ error/tiff.c/TIFFErrors/599.

Notes:

Those files seem to have the expected file size (except for the one from source 3), but the magic number is invalid/bad.
I updated the list with 1 corrupt file from source 3
I finished download of all other sources except 11 and have not found any other corrupt files.

Is gene perturbation metadata currently available? (cpg0016-jump)

Hi Team!

This data looks fantastic, and I can't wait to dig deeper into it. I read the advisory stating that cpg0016 is still a work in progress, as well as seeing the 'First draft of metadata files' bullet under 'What's available now'. From perusing the metadata I found 15,133 unique Metadata_JCP2022 values in the well table that are not in the compound table, so I am assuming that these are the gene perturbations, is that correct? If so, my main question is - is there metadata (either in the form of another table, or individual platemap CSVs) for these conditions yet?

Apologies if the answer is just that it will be coming in future updates, and I'm just being too over-eager.

Thanks a lot!

Best,
Asa Barth-Maron

jump-cellpainting / datasets Goto Github PK

datasets's People

Stargazers

Watchers

Forkers

datasets's Issues

Recommend Projects

Recommend Topics

Recommend Org