imageomics / image-datapalooza-2023 Goto Github PK

Repository for the Image Datapalooza 2023 event held at OSU in August 2023.

License: Creative Commons Zero v1.0 Universal

image-datapalooza-2023's Introduction

Image Datapalooza 2023

Repository for information advertising and documenting the Image Datapalooza 2023 event to be held at The Ohio State University on August 14-17, 2023.

image-datapalooza-2023's People

Contributors

Stargazers

Watchers

Forkers

douglasmbura luis-gonzalez-m

image-datapalooza-2023's Issues

What data are you bringing?

If you have a data set that you are planning to focus on at Image Datapalooza, could you drop a response here explaining a little about the data set and what questions you'd like to be able to answer?

Automatic data curation and feature extraction of museum images of critically endangered mollusks

I'm bringing an image dataset (about 8,000 images) that contain two different angle views of about a half a million specimens of North American freshwater bivalve shells.
Freshwater bivalves are the most endangered animals on the planet, and many of these species have suffered serious population level declines and range contractions over the course of the last century. The OSU Museum of Biological Diversity Mollusk Division houses the largest freshwater bivalve collection in the world, and furthermore contains about a quarter of all known museum specimens of endangered, threatened, and extinct species. We have specimens not only from the majority of watersheds in North America, but in many cases from the same sites collected at multiple different time periods. This makes the OSUM Mollusk Division's collection a very powerful resource to ask questions about continental-scale changes in phenotype correlated with anthropogenic disturbance (dams, pollution) and climate change.

The dataset consists of images of whole drawers of specimens from two angles -- top down, and 45º. The drawers contain individual boxes of specimens called "lots". 1 lot is the set of all the specimens of a species collected at a single place and time. All lots in the collection have a unique numeric catalogue number which is printed on the top right corner of a cardstock label in the box.
All images were taken using the same lighting setup and contain an Calibrite ColorChecker Nano and a QP Card QP101 Calibration Card with mm scale bar.

(A sample from the dataset can be downloaded here)

My goal is to get help to use CV / ML methods to:

segment both images of each drawer of specimens into lots.
Use OCR to capture the catalogue number of each lot from its label and add the number to the image metadata
assign GUIDs to the images and make the dataset available online for use for morphological analysis.

I would definitely be interested in testing some hypotheses about the distribution of different morphological traits and color patterns using this dataset. It would be the largest dataset of its kind in existence for mollusks.
Please reach out if you're interested in collaborating on some or all of this!
-Nate

Originally posted by @nfshoobs in #3 (comment)

Anatomic images with associated detailed descriptions from taxonomic treatments dataset

I'd like to work with the Plazi taxonomic treatments dataset, which includes many images with associated anatomical descriptions. However, the images typically contain several subpanels within each, and likewise the text combines the descriptions for all the sub panels. I'm hoping to separate these into correctly grouped images and descriptions, and further to link the text to taxonomic names and anatomy ontology concepts.

Originally posted by @balhoff in #3 (comment)

Gather and Process Satellite Data in Real-Time

For Andromeda, we generated tiles surrounding a given location with information on the percentage of various forms of landcover for that particular area. The process utilized (through ArcGIS’ API) is recorded here.
There were challenges with interactions with the ArcGIS server that made this a very time-consuming process and prevented us from collecting the data in real-time to match the exact location at which people took their images. The OSU ASC ESRI-SUPPORT team suggested we look into using PyQGIS as an alternative to ArcPy. (QGIS is an open source GIS application.)
More notes from them:

There is some specific information in the Introduction of the PyQGIS Developer Cookbook about using PyQGIS in standalone scripts and custom applications.
Sections 6 (Using Vector Layers), 7 (Geometry Handling) and 8 (Projections Support) have functions that might replace the geometry and projection functions from ArcPy.
Different types of spatial analysis are accomplished in PyQGIS (and QGIS generally) using processing algorithms. The main QGIS documentation has a section dedicated to processing algorithms for different operations. There is one for overlap analysis, which “calculates the area and percentage cover by which features from an input layer are overlapped by features from a selection of overlay layers,” and this is essentially what Tabulate Intersection does in ArcGIS. (Tabulate Intersection was essentially what we were using.)
He did not think Python libraries like Shapely and GeoPandas had anything that mirrored Tabulate Intersection.

The question is if anyone has tried doing something similar, and if so, what have you used? If not, would anyone be interested in trying to piece together a more open-sourced method of doing this that may allow for real-time calculations/access?

The second part of it is access to landcover data at granular levels. We were able to use a layer specific to NJ, but I’d imagine something like it exists for other states as well. The tiles we generated for Andromeda proved useful in the QUEST program’s analysis of pollinators, and it’d be great to see if we could expand it to be available across the country.

Matchmaker site between biologist and ML experts

From Scott Rifkin (UCSD):

One outcome that could be very useful would be something like a matchmaker site where a biologist, for example, could post a description of the data and the sort of analysis they have in mind, and ML experts could see if anything interests them to collaborate on. Or the converse. As a biologist I have some image data datasets and have in mind some datasets where ML could be a big help, but I don't really know of an efficient way to find a collaborator with the relevant expertise who might also find the particular problem interesting. And I imagine there are ML experts who don't know what datasets are floating around out there and who might have them that might pose exactly the sort of problem they are interested in. Some way to make it easier to find interested and relevant collaborators outside and beyond the few days of the workshop would be a very useful resource.

Data Dashboard for Expedited EDA

There’s a lot of available data, but determining its level of usefulness (via exploratory data analysis) can be a very time-consuming process. I have been working on a dashboard to visualize distribution information and samples of datasets efficiently and with no coding required. We have a pre-release hosted for use during Image Datapalooza. It is currently a bit Imageomics-focused, though images aren’t required to gather distribution statistics. I think it would be great to expand out the functionality for a more general audience as a way to quickly generate visuals for data or explore whether new data would be suitable for experiments without having to invest large amounts of time into EDA.

Consider using and contributing to biotaxa - Visualising the growth and prediction of taxonomic diversity and identifying taxonomic incompleteness and imprecision

Biotaxa visualizes the growth of taxonomic diversity and computes taxonomic completeness/imprecision in the history of biodiversity discovery.

The default data was retrieved from the Register of Antarctic Species, but the R code can potentially be applied to taxonomic data of all repositories.

Consider using Andromeda - an interactive, high-dimensional data exploration tool

As datasets are developed, consider validating and/or exploring your data with a data visualization tool. This tool was originally invented by a team at Virginia Tech, but now has been made available and accessible by an Imageomics team! The tool and a video-tutorial can be found here, respectively.

Andromeda:
https://andromeda.imageomics.org/

How-to-video (very rough, definitely not worthy of any academy awards, haha):
https://drive.google.com/drive/u/0/folders/1x9MLwfHBQv7Tm6K2vOQK-6VNToIHFQyc

I am not sure if this is a project idea... but Andromeda is an EXCELLENT way to engage both professional and novice analysts. As we progress in Datapolooza, what data can we create that are both relatable AND interesting, in that the data have structures (e.g., clustering) worthy of pointing out.

imageomics / image-datapalooza-2023 Goto Github PK

image-datapalooza-2023's Introduction

Image Datapalooza 2023

image-datapalooza-2023's People

Contributors

Stargazers

Watchers

Forkers

image-datapalooza-2023's Issues

What data are you bringing?

Automatic data curation and feature extraction of museum images of critically endangered mollusks

Anatomic images with associated detailed descriptions from taxonomic treatments dataset

Gather and Process Satellite Data in Real-Time

Matchmaker site between biologist and ML experts

Data Dashboard for Expedited EDA

Consider using and contributing to biotaxa - Visualising the growth and prediction of taxonomic diversity and identifying taxonomic incompleteness and imprecision

Consider using Andromeda - an interactive, high-dimensional data exploration tool

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent