Code Monkey home page Code Monkey logo

Comments (9)

niranjchandrasekaran avatar niranjchandrasekaran commented on July 20, 2024

Thanks @hanslovsky for flagging these. ccing @shntnu to bring this to his attention.

from datasets.

hanslovsky avatar hanslovsky commented on July 20, 2024

I downloaded all sources except source 11 (still working on that) and found only one additional corrupt file in source 3. All other sources (except 11) did not have corrupt files.

from datasets.

shntnu avatar shntnu commented on July 20, 2024

Thank you so much for reporting this @hanslovsky

  • Could you let us know if source_11 had any corrupt files?
  • Do you have any thoughts on how we should report this? I was thinking we could create a new top-level folder https://github.com/jump-cellpainting/datasets/tree/main/errors and a CSV file within for each data component (e.g. images.csv to report missing/corrupt images).

from datasets.

Arkkienkeli avatar Arkkienkeli commented on July 20, 2024

I did run identify on all sources and created a list of all corrupted images according to this utility.
If a value from Channel \ Well \ Site is missing, it means that the image is not in the metadata, for example, all corrupted images in this list from source_10 are actually not in the metadata (probably it is described here #61).

@shntnu @hanslovsky

Corrupted_images.csv

from datasets.

hanslovsky avatar hanslovsky commented on July 20, 2024

@Arkkienkeli your findings are consistent with mine (I did not report any corrupted images that are not in the metadata), with the exception of the one image of source 11. I did not report anything for source 11 in this issue because I was still working on it at that time. I will double-check my records to see if I have any notes on corrupted files for source 11.

I know that I reported missing images for source 11 in #78 but I don't know if that includes any corrupted images.

cc @shntnu

from datasets.

hanslovsky avatar hanslovsky commented on July 20, 2024

@Arkkienkeli I just double-checked the images I reported missing in source 11 (source_11-404.txt) and I found the image you reported corrupted in there as well. Now I can conclusively say that both our reportings are consistent.

Please note that I also found some images in source 11 that were simply not present, in plates EC000038and EC000066

from datasets.

shntnu avatar shntnu commented on July 20, 2024

I will drop in some notes for now

cat ~/Downloads/source_11-404.txt |cut -d"/" -f6|sort|uniq -c
6064 EC000038__2021-06-04T17_37_00-Measurement1
   2 EC000066__2021-06-06T12_36_15-Measurement1
   1 EC000070__2021-06-09T23_50_19-Measurement1
   1 failed-paths
csvcut -c Source,Batch,Plate ~/Downloads/Corrupted_images.csv |sort|uniq|wc -l
      19

csvcut -c Source,Batch ~/Downloads/Corrupted_images.csv |sort|uniq|wc -l
      15
      
csvcut -c Source ~/Downloads/Corrupted_images.csv |sort|uniq|wc -l
      6
      
csvcut -c Source ~/Downloads/Corrupted_images.csv |sort|uniq -c
   5 1
  23 10
   1 11
   1 3
   4 7
   1 Source

Internal notes

  1. EC000038 on batch2. This plate has the metadata (xml file) and a significant number of images missing. I checked with XXX and she says they are also missing on the microscopy server. Should this be skipped?
  2. Order-of-magnitude, how many images are missing - 10, 100, 1000, 10000? I assume with no Index.idx.xml file you weren't able to run pe2loaddata, but it's pretty trivial to just make the load_data and load_data_with_illum CSVs from another plate in the batch with a find-and-replace on the plate name (and removing missing files from the load_data  csv per above). I think as long as you have at least say, half the plate still present, no reason to throw out this data.
  3. EC000038 on batch2. I checked it and found out we had > 2000 image sets useable. Copied over the xml file from another plate and processed it.

from datasets.

shntnu avatar shntnu commented on July 20, 2024

Alright, overall

  • EC000038 the files that were missing here are because we created the load_data file by hand (see internal notes in the previous comment). We should edit the load_data to filter out the sites that have a missing image
  • EC000066 and EC000070 - turns out these two plates are also among those where we created the load_data file by hand, so we should do the same here
  • Here's the full list of source_11 plates missing load_data files: EC000038 , EC000066, EC000070, EC000156, EC000157 so we should expect similar issues with all of these

@hanslovsky @Arkkienkeli -- thank you so much for reporting this! You can proceed by simply ignoring these images. Our task is to update the load data files to remove the discrepancy

from datasets.

shntnu avatar shntnu commented on July 20, 2024

I did run identify on all sources and created a list of all corrupted images according to this utility. If a value from Channel \ Well \ Site is missing, it means that the image is not in the metadata, for example, all corrupted images in this list from source_10 are actually not in the metadata (probably it is described here #61).

@shntnu @hanslovsky

Corrupted_images.csv

Regarding the corrupted files, we should likely take the same strategy – drop them from load_data. @Arkkienkeli -- You can proceed by ignoring these images because we no longer have access to the originals (thankfully that's only 34 images out of the gazillion)

from datasets.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.