Code Monkey home page Code Monkey logo

lostruct-py's Introduction

lostruct-py

This is a reimplementation of lostruct from the original code: lostruct, by Joseph Guhlin with assistance by Peter Ralph.

DOI codecov

Demonstration

Please see the Example Notebook

Installation

Lostruct-py is available on PyPi pip install lostruct is the easiest way to get started.

Usage

Input Files

Inputs should be a set of markers in BCF or VCF format. Both should be indexed as appropriate (see: bcftools index). Filtering before running this analysis is strongly suggested (Allele frequency, SNPs only, missingness, etc).

Citing

If you use this version, plesae cite it via Zenodo DOI: 10.5281/zenodo.3997106 as well as the original paper describing the method:

Li, Han, and Peter Ralph. "Local PCA shows how the effect of population structure differs along the genome." Genetics 211.1 (2019): 289-304.

CyVCF2

This project also uses cyvcf2 for fast VCF processing and should be cited:

Brent S Pedersen, Aaron R Quinlan, cyvcf2: fast, flexible variant analysis with Python, Bioinformatics, Volume 33, Issue 12, 15 June 2017, Pages 1867โ€“1869, https://doi.org/10.1093/bioinformatics/btx057

Changes from Lostruct R package

Please note numpy and R are different when it comes to row-major vs. column-major. Essentially, many things in the python version will be transposed from R.

Requirements

Python >= 3.6 (may work with older versions). Developed on Python 3.8.5

  • numba
  • numpy
  • cyvcf2

CyVCF2 requires zlib-dev, libbz2-dev, libcurl-dev, liblzma-dev; numa requires libllvm. These may be installed with conda or pip, e.g. by running pip install -r requirements.txt.

Changes

See CHANGES.MD for the full list.

0.0.5 (pending)

  • Smallest circle
  • Better in-built JAX support

0.0.4

  • Package name changed to lostruct
  • Parallelization of get_pc_dists
  • Implementation of fastmath parameter for get_pc_dists

Tests

Tests were derived from the Medicago HapMap data. The python package and the R package have high correlation of values. If values begin to deviate, theese tests will now fail.

To run tests simply do:

pip install pytest
pytest --benchmark-disable tests/test_fns.py

The tests furthermore require unittest and scikit-bio (and pytest to run them this way).

Benchmarks

To run tests with benchmarks, install the following:

pip install pytest-benchmark

Then run ```pytest``

TOX

Tox allows you run tests with multiple versions of the python interpreter in venvs. It is best to use pyenv to install multiple versions python to run before submitting pull requests to be certain tests complete successfully across all versions.

To run tests via tox:

pip install tox

Correlation Data

To test correlation of results between the R and Python versions we used data from the Medicago HapMap project, specifically SNPs for sister taxa chromsome 1, processed, and run with LoStruct R.

Data

bcftools annotate chr1-filtered-set-2014Apr15.bcf -x INFO,FORMAT | bcftools view -a -i 'F_MISSING<=0.2' | bcftools view -q 0.05 -q 0.95 -m2 -M2 -a -Oz -o chr1-filtered.vcf.gz

Lostruct Processing

Rscript run_lostruct.R -t SNP -s 95 -k 10 -m 10 -i data/

Run 21 Aug 2020, using lostruct R git hash: 444b8c64bebdf7cdd0323e7735ccadddfc1c8989

This generates the mds_coords.tsv that is used in the correlation comparison. Additionally, the existing tests cover correlation.

To test the weight generation, a random sample of weights was created and used. Output was moved to lostruct-results/weights_mds_coords.csv and generated with the random_weights.txt found in test_data/random_weights.txt.

./run_lostruct.R -i data -t snp -s 95 -k 10 -m 10 -w random_weights.txt

FAQ / Notes

Future

Currently the end-user is expected to save the outputs. But would be good to save it in a similar way to lostruct R-code. Please open an issue if you need this.

Feature Completeness with R implementation

We are not yet feature complete with the R implementation. If something is needed please check for existing issues and comment about your need.

PCA, MDS, PCoA

PCoA returns the same results as lostruct's MDS implementation (cmdscale). In the example Jupyter notebook you can see the correlation is R =~ 0.998. Some examples of other methods of clustering / looking at differences are included in the notebook.

Speed and Memory

NUMBA and CyVCF2 are used for speeding up processes, and the software becomes multithreaded by default. The Sparse library is used to reduce memory requirements. parse_vcf function is multithreaded. Distance calculation is not.

tl;dr of below

Below two options are offered, fastmath for get_pc_dists function, and method="fsvd" for pcoa. When using both you will see a performance increase and memory requirement decrease. Accuracy should decrease, but the absolute correlation we see with our test dataset remains ~0.998. Be aware when using fsvd the sign of the correlation may change.

JAX

Jax is an open-source library.... optional support in lostruct. Allows for processing on GPU and TPU (untested here, so far). See JAX. You can enable it in the function get_pc_dists by setting jax=True. This results in XX% speedup. Fastmath (below) and JAX are both supported, although JAX outperforms fastmath.

Fastmath

Additionally, a mode implemented Numba's "fastmath" is available. For the function get_pc_dists set fastmath=True. This results in a ~8% speed boost with very little change in the final output (correlation to R code output remains >= 0.995). This was benchmarked on the Medicago data used in the jupyter notebook using timeit, with 100 repeats with fastmath=False and Fastmath=True. get_pc_dists(result, fastmath=True)

The difference with fastmath=True and leaving it off can be seen here. Note: Downloading the file will allow you to see more detailed information, as some javascript is contained in the SVG but disabled on GitHub.

If you need to limit thread usage, please see Numba's guide

Very Large Datasets

The R implementation handles very large datasets in less memory. The problem arises with the PCoA function. A metric MDS using sklearn may work. Another alternative would be to export the data and run cmdscale in R directly.

The sklearn MDS function differs from the scikit-bio function, here we focus on the scikit-bio version.

There are two options in python for this as well: pcoa(method="fsvd", ...) Which reduces memory and increases speed, at the cost of some accuracy.

pcoa(inplace=True, ...) Centers a distance matrix in-place, further reducing memory requirements.

pcoa(number_of_dimensions=10) Returns only the first 10 dimensions (configurable) of the scaling. This has no real effect if method is default or manuially set to "eigh" as the eigenvalues and eigenvectors are all calculated, so all are calculated and this becomes a truncation.

You can see the difference between method="fsvd" and method="eigh" (default) here. These are tested with a minimum of 50 rounds. Note: Downloading the file will allow you to see more detailed information, as some javascript is contained in the SVG but disabled on GitHub.

Using all three techniques, correlation is maintained although the sign may change.

mds = pcoa(pc_dists, method="fsvd", inplace=True, number_of_dimensions=10)
np.corrcoef(mds.samples["PC1"], mds_coords['MDS1'].to_numpy())[0][1]
-0.9978147088087447

For more information please see the applicable documentation as well as the relevant changelog. A Zenodo entry is also available on this topic.

References

Additional citations can be found in CITATIONS (UMAP, PHATE, Medicago HapMap).

Miscellaneous Notes

We are using the code formatter BLACK. Also, code coverage is actually 100% but numba JIT'd code is not properly counted. As long as all tests complete everything is working.

lostruct-py's People

Contributors

alxsimon avatar jguhlin avatar petrelharp avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

lostruct-py's Issues

document tests

I'm having a look at the tests. Very nice! A few things would help do a better double-check the correctness/completeness:

  • somewhere (the README?) it should say how to run the tests (install tox; run tox)

  • I'm having trouble with actually running it - I've got python3.7, so at first tox complained that it couldn't find py38; so I edited tox.ini to be py37, and then it runs but says No module named 'skbio', despite this being installed. I'm more familiar with the unittest framework, and would be happy to shift over these tests to that framework, if you don't have a strong opinion?

  • I see the tests are mostly comparing to pre-calculated values; could you document where those values came from (ie, the code that was run to produce it)? E.g., in a comment next to the value, but we should also provide an R script to produce the values (especially, lostruct-results/mds_coords.csv).

TODO: Weights

Implement weights as well as testing with pre-generated random weights (only generate random weights once, so they remain stable for testing)

result = np.vstack(result) returns ValueError

I am trying lostruct-py on a Anopheles gambiae 1000 genomes VCF file. I followed the Lostruct Example notebook.

The step result = np.vstack(result) raise a ValueError: setting an array element with a sequence. The requested array has an inhomogeneous shape after 1 dimensions. The detected shape was (4,) + inhomogeneous part.

If I look at the content of the list result, it returns:

[(masked_array(
    data=[[0.024377151620136803, 0.0055554613461685715,
           0.010918887311073908, ..., -0.010985846369178787,
           0.0009770968159092588, -0.006400975680106911],
          [0.0055554613461685715, 0.017436751370230114,
           0.002998137131115064, ..., -0.007354442130630891,
           0.020027221589469815, -0.004796715421154812],
          [0.010918887311073908, 0.002998137131115064,
           0.018562483188029614, ..., -0.010009942403568376,
           -0.001372045844136154, -0.005033924953068051],
          ...,
          [-0.010985846369178787, -0.007354442130630891,
           -0.010009942403568376, ..., 0.022317520802542085,
           -0.007211201425648945, 0.004266296593160091],
          [0.0009770968159092588, 0.020027221589469815,
           -0.001372045844136154, ..., -0.007211201425648945,
           0.04005025065896817, -0.004584271829876065],
          [-0.006400975680106911, -0.004796715421154812,
           -0.005033924953068051, ..., 0.004266296593160091,
           -0.004584271829876065, 0.03756765517972615]],
    mask=[[False, False, False, ..., False, False, False],
          [False, False, False, ..., False, False, False],
          [False, False, False, ..., False, False, False],
          ...,
          [False, False, False, ..., False, False, False],
...
          ...,
          [False, False, False, ..., False, False, False],
          [False, False, False, ..., False, False, False],
          [False, False, False, ..., False, False, False]],
    fill_value=1e+20))]

What is displayed is only one window (output is truncated by vs code). In my opinion, I only need data in that list to do the pc_dist command, not mask and fill_value.
Lostruct_example_output.txt

name: lostruct?

If you agree, I think this might as well be called lostruct instead of lostructpy, although if you're wanting to distribute it I'd like to be a co-maintainer?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.