rrrlw / tdastats Goto Github PK

R pipeline for computing persistent homology in topological data analysis. See https://doi.org/10.21105/joss.00860 for more details.

Home Page: https://rrrlw.github.io/TDAstats

License: GNU General Public License v3.0

R 65.44% C++ 31.38% TeX 3.17%

topological-data-analysis homology-calculations cran ripser ggplot2 pipeline r persistent-homology tda homology topology homology-computation visualization data-science r-package joss r-packages topology-visualization

tdastats's Introduction

TDAstats: topological data analysis in R

Overview

TDAstats is an R pipeline for computing persistent homology in topological data analysis.

Installation

To install TDAstats, run the following R code:

# install from CRAN
install.packages("TDAstats")

# install development version from GitHub
devtools::install_github("rrrlw/TDAstats")

# install development version with vignettes/tutorials
devtools::install_github("rrrlw/TDAstats", build_vignettes = TRUE)

Sample code

The following sample code creates two synthetic datasets, and calculates and visualizes their persistent homology to showcase the use of TDAstats.

# load TDAstats
library("TDAstats")

# load sample datasets
data("unif2d")
data("circle2d")

# calculate persistent homology for both datasets
unif.phom <- calculate_homology(unif2d, dim = 1)
circ.phom <- calculate_homology(circle2d, dim = 1)

# visualize first dataset as persistence diagram
plot_persist(unif.phom)

# visualize second dataset as topological barcode
plot_barcode(circ.phom)

A more detailed tutorial can be found in the package vignettes or at this Gist.

Functionality

TDAstats has 3 primary goals:

Calculation of persistent homology: the C++ Ripser project is a lightweight library for calculating persistent homology that outpaces all of its competitors. Given the importance of computational efficiency, TDAstats naturally uses Ripser behind the scenes for homology calculations, using the Rcpp package to integrate the C++ code into an R pipeline (Ripser for R).
Statistical inference of persistent homology: persistent homology can be used in hypothesis testing to compare the topological structure of two point clouds. TDAstats uses a permutation test in conjunction with the Wasserstein metric for nonparametric statistical inference.
Visualization of persistent homology: persistent homology is visualized using two types of plots - persistence diagrams and topological barcodes. TDAstats provides implementations of both plot types using the ggplot2 framework. Having ggplot2 underlying the plots confers many advantages to the user, including generation of publication-quality plots and customization using the ggplot object returned by TDAstats.

Contribute

To contribute to TDAstats, you can create issues for any bugs/suggestions on the issues page. You can also fork the TDAstats repository and create pull requests to add features you think will be useful for users.

Citation

If you use TDAstats, please consider citing the following (based on use):

General use of TDAstats: Wadhwa RR, Williamson DFK, Dhawan A, Scott JG. TDAstats: R pipeline for computing persistent homology in topological data analysis. Journal of Open Source Software. 2018; 3(28): 860. doi: 10.21105/joss.00860
TDAstats to calculate persistent homology (Ripser): Bauer U. Ripser: Efficient computation of Vietoris-Rips persistence barcodes. 2019; arXiv: 1908.02518.
TDAstats to perform statistical test: Robinson A, Turner K. Hypothesis testing for topological data analysis. J Appl Comput Topol. 2017; 1: 241.

Real-world applications, use cases, and mentions

Stenseke J. Persistent homology and the shape of evolutionary games. Journal of Theoretical Biology. 2021; 531: 110903. Link to paper.
Torres-Espin A, Haefeli J, Ehsanian R, et al. Topological network analysis of patient similarity for precision management of acute blood pressure in spinal cord injury. eLife. 2021; 10: e68015. Link to paper.
Somasundaram E, Litzler A, Wadhwa R, Owen S, Scott J. Persistent homology of tumor CT scans is associated with survival in lung cancer. Medical Physics. 2021; 48(11): 7043-7051. Link to paper and preprint.
Richardson M, Verma R, Singhania A, Tabone O, Das M, Rodrigue M, Leissner P, Woltmann G, Cooper A, O'Garra A, Haldar P. Blood transcriptional phenotypes of progressive latent M. tuberculosis infection inform novel signatures that improve prediction of tuberculosis risk. Cell Reports Medicine. 2021. Link to paper.
Perez-Moraga R, Fores-Martos J, Suay-Garcia B, Duval J-L, Falco A, Climent J. A COVID-19 Drug Repurposing Strategy through Quantitative Homological Similarities Using a Topological Data Analysis-Based Framework. Pharmaceutics. 2021; 13(4): 488. Link to paper.
Kandanaarachchi S, Hyndman RJ. Leave-one-out kernel density estimates for outlier detection. Monash University. 2021. Link to paper.
Somasundaram EV, Brown SE, Litzler A, Scott JG, Wadhwa RR. Benchmarking R packages for calculation of persistent homology. R Journal. 2021; 13(1): 184-193. Link to paper.
Brochard A, Blaszczyszyn B, Mallat S, Zhang S. Particle gradient descent model for point process generation. 2020. arXiv:2010.14928. Link to preprint.
Nguyen DQN, Xing L, Lin L. Community detection, pattern recognition, and hypergraph-based learning: approches using metric geometry and persistent homology. 2020. arXiv:2010.00435. Link to preprint.
Pinto GVF. Motivic constructions on graphs and networks with stability results. Doctoral Thesis: Universidade Estadual Paulista Rio Claro & Ohio State University. 2020. Link to thesis.
Gommel M. A Machine Learning Exploration of Topological Data Analysis Applied to Low and High Dimensional fMRI Data. Doctoral Thesis: University of Iowa. 2019. doi: 10.17077/etd.005247. Link to thesis.
Mémoli F, Singhal K. A Primer on Persistent Homology of Finite Metric Spaces. Bulletin of Mathematical Biology. 2019; 81(7): 2074. Links to paper and preprint
Srinivasan R, Chander A. Understanding Bias in Datasets using Topological Data Analysis. Fujitsu Laboratories of America. 2019. Link
Kough D, Neuzil M, Simpson C, Glover R. Analyzing State of the Union Addresses using Topology. University of St. Thomas. 2019. Link
Rickert J. A Mathematician's Perspective on Topological Data Analysis and R. 2018. Link
Blog post on Data Management
Analyzing finance data
R package for visualizing persistent homology

tdastats's People

Contributors

Stargazers

Watchers

Forkers

leeper jbdatascience vishalbelsare peekxc shaelebrown estherheelee corybrunson

tdastats's Issues

calculate_homology results

Hi! When I use calculate_homology over a graph with 7 vertices (for example) I only obtain 6 features at dimension 0 and that start with a filtration weight 0, why is that? Shouldn't be 7 features? I couldn't find the reason in your guidelines or vignettes.
Thanks!

allow `calculate_homology` to return data frames

Currently, TDAstats::calculate_homology only returns a matrix. When converted to a data frame with as.data.frame, users need to convert the dimension column to a factor (instead of numeric) to accurately plot colors in barcode (allow discrete colors instead of quantitative color spectrum). Easy fix would be to add parameter that returns a properly formatted data frame to users so that they don't have to do any extra steps.

Change plot_barcode x-axis label

Current Vietoris-Rips diameter but user could be plotting output from Cech or cubical complex, change to simplicial complex diameter

move over calculate_homology functionality to ripserr

move after ripserr stable version is on CRAN

Persistence homogy

Dear All,

Please how do I read data from and external file to obtain the persistence homology:

The file has 4 columns: x y z. and it is in a certain directory let's say /home/linda/Destop/file.

Thank you!

depend on R v3.3 instead of v3.4?

Since i use Mac OS X 10.9 on my laptop, i only have R version 3.3.2. As an experiment, i cloned this repo, changed the dependency to R (>= 3.3), and installed using devtools::install(). It worked fine, and i was able to work through all of the examples. Are there specific reasons for requiring version 3.4?

phom.dist does not calculate wasserstein distance

Need to adjust documentation to reflect above fact. Potentially add parameter in permutation test function to allow user to pick their own distance function (takes persistent homology of two datasets as parameters, returns numeric).

Thanks to @kisungyou for bringing this to my attention.

Add stop points in wrapped C++ code

Otherwise Rstudio's "Stop" button won't work properly

Return persistent homology as data frame instead of matrix

Data frames are better for visualization (using the grammar of graphics system in ggplot2); should be a quick switch to return as a data frame instead of matrix.

update contribution guidelines + code of conduct

based on contributor covenant

Support for other field coefficients

This is more of a question.

Is it in the scope of this package to re-adjust the ripser source code to support other coefficients in a prime field?

As you know, this should simply involve replacing areas like e.g. here with the code compiled when the USE_COEFFICIENTS preprocessor variable is enabled in the ripser package.

I imagine usage like

TDAstats::calculate_homology(unif2d, dim=1, p=11)

or something

Suggestion for improving the vignettes

Hi,
We can improve the vignettes.
Honestly , there are several parts I don't understand in the vignettes.
What is @roadmap-ph in the vignette "Introduction to persistent homology with TDAstats"?
I think it's kind for newcomers like me if you explain what is @roadmap-ph.
I listed the parts I don't understand below.

The parts I don't understand (click here)

The parts I don't understand in "Introduction to persistent homology with TDAstats"

@roadmap-ph
[@Rcpp-paper]
[@ggplot2-book]

The parts I don't understand in "Hypothesis testing with TDAstats"

@resampling-book
@hyptest
[@wasserstein-calc]
@resampling-book

By the way, references are missing in the vignettes, although there are sections for references in the bottoms of the vignettes.
You should list some references or remove the sections for references.
Thanks!

fixed coordinates for persistence plot?

Since persistence plots implicitly rely on a 1:1 aspect ratio for visual interpretability, would it be appropriate to include + ggplot2::coord_fixed(ratio = 1) to the 'ggplot' object returned by plot_persist()? (Certainly lower- and higher-persistence features are discriminable regardless of the aspect ratio, but for professional publications this has, to my knowledge, been the rule. And, like the other plot specs, this could be overridden by the user.)

Part of this JOSS review.

inspection methods for test output

Test output, e.g. from permutation_test(), currently returns a somewhat unwieldy list. I think the following methods would be useful to implement:

print() (see various methods for {stats} test output for inspiration)
summary(), possibly
tidy() and glance() from {generics}, as used in {broom} and its extensions
autoplot(), e.g. histograms of null samples to illustrate p-values

However, this requires that the tests return objects of some S3 class. This could be a new class or classes, or the existing 'htest' class---or possibly both, in case some tasks can be dispatched to 'htest' methods but others must be more specific.

stat and geom layers

Would it make sense to supplement the plot_*() shorthands with stat_*() and geom_*() layers that perform the transformation and visualization duties separately? This would allow users to

produce plots using recognizable and (more) customizable ggplot2 syntax
render alternative visualizations of birth–death PH data
visualize data generated by other means (e.g. via construction of a Čech complex) in persistence and barcode diagrams

Since the plot_*() functions don't require a specific class of data frames (just recognizable column names), this could, i think, be done without any changes to current functionality.

diagonal missing from modified persistence plot

The following code chunk doesn't produce an error, but fails (for me) to produce the diagonal line in the persistence plot (i had trouble using reprex::reprex()):

library(TDAstats)
data(circle2d)
circ.phom <- calculate_homology(circle2d, dim = 1)
plot_persist(circ.phom) + ggplot2::xlim(c(0, .25))

Is the problem on my end? It prevents me from fully performing the visual comparison suggested in the "inference" vignette. I will try to reproduce it on a different machine tomorrow.

Part of this JOSS review.

limit of the filtration

How Do i set limit for filtration value? say for example i want to grow my simplicies upto radius 5

real-world illustration

The toy examples provided in with the package are helpful and wholly appropriate. The package would also benefit, i think, from an illustration using real-world data (as suggested but not required by the JOSS review checklist). Is there a dataset you've used for this purpose that could be included in the package and demonstrated either in the functional documentation or in a separate vignette? This isn't a sticking point for my review, but i do think the package would achieve its aims more effectively with such a case study.

Part of this JOSS review.

Distance measure used by phom.dist()

Hello! I am using the phom.dist() function to compute the distance between persistence diagrams. Can you clarify what distance measure is being computed by this function? Is there a reference/citation/source for the distance measure being computed? I was under the impression phom.dist() returned the Wasserstein distance based on the function naming, but looking at a previous issue (#13) I see that that isn't the case.

Thanks!

rrrlw / tdastats Goto Github PK

tdastats's Introduction

TDAstats: topological data analysis in R

Overview

Installation

Sample code

Functionality

Contribute

Citation

Real-world applications, use cases, and mentions

tdastats's People

Contributors

Stargazers

Watchers

Forkers

tdastats's Issues

Recommend Projects

Recommend Topics

Recommend Org