Code Monkey home page Code Monkey logo

image-datapalooza-2023's Introduction

Image Datapalooza 2023

Repository for information advertising and documenting the Image Datapalooza 2023 event to be held at The Ohio State University on August 14-17, 2023.

Image Datapalooza preliminary logo

image-datapalooza-2023's People

Contributors

hlapp avatar

Stargazers

 avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

image-datapalooza-2023's Issues

What data are you bringing?

If you have a data set that you are planning to focus on at Image Datapalooza, could you drop a response here explaining a little about the data set and what questions you'd like to be able to answer?

Automatic data curation and feature extraction of museum images of critically endangered mollusks

I'm bringing an image dataset (about 8,000 images) that contain two different angle views of about a half a million specimens of North American freshwater bivalve shells.
Freshwater bivalves are the most endangered animals on the planet, and many of these species have suffered serious population level declines and range contractions over the course of the last century. The OSU Museum of Biological Diversity Mollusk Division houses the largest freshwater bivalve collection in the world, and furthermore contains about a quarter of all known museum specimens of endangered, threatened, and extinct species. We have specimens not only from the majority of watersheds in North America, but in many cases from the same sites collected at multiple different time periods. This makes the OSUM Mollusk Division's collection a very powerful resource to ask questions about continental-scale changes in phenotype correlated with anthropogenic disturbance (dams, pollution) and climate change.

The dataset consists of images of whole drawers of specimens from two angles -- top down, and 45º. The drawers contain individual boxes of specimens called "lots". 1 lot is the set of all the specimens of a species collected at a single place and time. All lots in the collection have a unique numeric catalogue number which is printed on the top right corner of a cardstock label in the box.
All images were taken using the same lighting setup and contain an Calibrite ColorChecker Nano and a QP Card QP101 Calibration Card with mm scale bar.

(A sample from the dataset can be downloaded here)

My goal is to get help to use CV / ML methods to:

  1. segment both images of each drawer of specimens into lots.
  2. Use OCR to capture the catalogue number of each lot from its label and add the number to the image metadata
  3. assign GUIDs to the images and make the dataset available online for use for morphological analysis.

I would definitely be interested in testing some hypotheses about the distribution of different morphological traits and color patterns using this dataset. It would be the largest dataset of its kind in existence for mollusks.
Please reach out if you're interested in collaborating on some or all of this!
-Nate

Originally posted by @nfshoobs in #3 (comment)

Anatomic images with associated detailed descriptions from taxonomic treatments dataset

I'd like to work with the Plazi taxonomic treatments dataset, which includes many images with associated anatomical descriptions. However, the images typically contain several subpanels within each, and likewise the text combines the descriptions for all the sub panels. I'm hoping to separate these into correctly grouped images and descriptions, and further to link the text to taxonomic names and anatomy ontology concepts.

Originally posted by @balhoff in #3 (comment)

Gather and Process Satellite Data in Real-Time

For Andromeda, we generated tiles surrounding a given location with information on the percentage of various forms of landcover for that particular area. The process utilized (through ArcGIS’ API) is recorded here.
There were challenges with interactions with the ArcGIS server that made this a very time-consuming process and prevented us from collecting the data in real-time to match the exact location at which people took their images. The OSU ASC ESRI-SUPPORT team suggested we look into using PyQGIS as an alternative to ArcPy. (QGIS is an open source GIS application.)
More notes from them:

The question is if anyone has tried doing something similar, and if so, what have you used? If not, would anyone be interested in trying to piece together a more open-sourced method of doing this that may allow for real-time calculations/access?

The second part of it is access to landcover data at granular levels. We were able to use a layer specific to NJ, but I’d imagine something like it exists for other states as well. The tiles we generated for Andromeda proved useful in the QUEST program’s analysis of pollinators, and it’d be great to see if we could expand it to be available across the country.

Matchmaker site between biologist and ML experts

From Scott Rifkin (UCSD):

One outcome that could be very useful would be something like a matchmaker site where a biologist, for example, could post a description of the data and the sort of analysis they have in mind, and ML experts could see if anything interests them to collaborate on. Or the converse. As a biologist I have some image data datasets and have in mind some datasets where ML could be a big help, but I don't really know of an efficient way to find a collaborator with the relevant expertise who might also find the particular problem interesting. And I imagine there are ML experts who don't know what datasets are floating around out there and who might have them that might pose exactly the sort of problem they are interested in. Some way to make it easier to find interested and relevant collaborators outside and beyond the few days of the workshop would be a very useful resource.

Data Dashboard for Expedited EDA

There’s a lot of available data, but determining its level of usefulness (via exploratory data analysis) can be a very time-consuming process. I have been working on a dashboard to visualize distribution information and samples of datasets efficiently and with no coding required. We have a pre-release hosted for use during Image Datapalooza. It is currently a bit Imageomics-focused, though images aren’t required to gather distribution statistics. I think it would be great to expand out the functionality for a more general audience as a way to quickly generate visuals for data or explore whether new data would be suitable for experiments without having to invest large amounts of time into EDA.

Consider using Andromeda - an interactive, high-dimensional data exploration tool

As datasets are developed, consider validating and/or exploring your data with a data visualization tool. This tool was originally invented by a team at Virginia Tech, but now has been made available and accessible by an Imageomics team! The tool and a video-tutorial can be found here, respectively.

Andromeda:
https://andromeda.imageomics.org/

How-to-video (very rough, definitely not worthy of any academy awards, haha):
https://drive.google.com/drive/u/0/folders/1x9MLwfHBQv7Tm6K2vOQK-6VNToIHFQyc

I am not sure if this is a project idea... but Andromeda is an EXCELLENT way to engage both professional and novice analysts. As we progress in Datapolooza, what data can we create that are both relatable AND interesting, in that the data have structures (e.g., clustering) worthy of pointing out.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.