aehrc / variantspark Goto Github PK

machine learning for genomic variants

Home Page: http://bioinformatics.csiro.au/variantspark

License: Other

Shell 3.02% Java 1.60% Scala 17.40% R 0.43% Python 4.40% Makefile 0.33% Jupyter Notebook 1.91% CSS 0.01% HTML 0.06% JavaScript 70.84%

variant-spark gwas random-forest genome association-studies vcf variantspark notebook databricks bioinformatics

variantspark's Introduction

Variant Spark

variant-spark is a scalable toolkit for genome-wide association studies optimized for GWAS-like datasets.

Machine learning methods and, in particular, random forests (RFs) are promising alternatives to standard single SNP analyses in genome-wide association studies (GWAS). RFs provide variable importance measures to rank SNPs according to their predictive power. Although there are several existing random forest implementations available, some even parallel or distributed such as Random Jungle, ranger, or SparkML, most of them are not optimized to deal with GWAS datasets, which usually come with thousands of samples and millions of variables.

variant-spark currently provides the basic functionality of building a random forest model and estimating variable importance with the mean decrease gini method. The tool can operate on VCF and CSV files. Future extensions will include support for other importance measures, variable selection methods, and data formats.

variant-spark utilizes a novel approach of building random forests from data in transposed representation, which allows it to efficiently deal with even extremely wide GWAS datasets. Moreover, since the most common genomics variant calls file format, i.e. VCF, which uses the transposed representation, variant-spark can work directly with the VCF data, without the costly pre-processing required by other tools.

variant-spark is built on top of Apache Spark – a modern distributed framework for big data processing, which gives variant-spark the ability to scale horizontally on both bespoke cluster and public clouds.

The potential users include:

Medical researchers seeking to perform GWAS-like analysis on large cohort data of genome-wide sequencing data or imputed SNP array data.
Medical researchers or clinicians seeking to perform clustering on genomic profiles to stratify large-cohort genomic data.
General researchers with classification or clustering needs of datasets with millions of features.

Community

Please feel free to add issues and/or upvote issues you care about. Also, join the Gitter chat. We also started ReadTheDocs and there is always this repo's issues page for you to add requests. Thanks for your support.

Learn More

To learn more watch this video from HUGO Conference 2020.

Building

variant-spark requires java jdk 1.8+ and maven 3+

In order to build the binaries use:

mvn clean install

For Python variant-spark requires Python 3.6+ with pip. The other packages required for development are listed in dev/dev-requirements.txt and can be installed with:

pip install -r dev/dev-requirements.txt

or with:

./dev/py-setup.sh

The complete build including all checks can be run with:

./dev/build.sh

Running

variant-spark requires an existing spark 3.1+ installation (either a local one or a cluster one).

To run variant-spark use:

./variant-spark [(--spark|--local) <spark-options>* --] [<command>] <command-options>*

To obtain the list of the available commands use:

./variant-spark -h

To obtain help for a specific command (for example importance) use:

./variant-spark importance -h

You can use --spark marker before the command to pass spark-submit options to variant-spark. The list of spark options needs to be terminated with --, e.g:

./variant-spark --spark --master yarn-client --num-executors 32 -- importance ....

Please, note that --spark needs to be the first argument of variant-spark

You can also run variant-spark in the --local mode. In this mode, variant-spark will ignore any Hadoop or Spark configuration files and run in the local mode for both Hadoop and Spark. In particular, in this mode, all file paths are interpreted as local file system paths. Also, any parameters passed after --local and before -- are ignored. For example:

./bin/variant-spark --local -- importance  -if data/chr22_1000.vcf -ff data/chr22-labels.csv -fc 22_16051249 -v -rn 500 -rbs 20 -ro

Note:

The difference between running in --local mode and in --spark with local master is that in the latter case, Spark uses the Hadoop filesystem configuration and the input files need to be copied to this filesystem (e.g. HDFS) Also, the output will be written to the location determined by the Hadoop filesystem settings. In particular paths without schema e.g. 'output.csv' will be resolved with the Hadoop default filesystem (usually HDFS) To change this behavior you can set the default filesystem in the command line using spark.hadoop.fs.default.name option. For example to use local filesystem as the default use:

./bin/variant-spark --spark ... --conf "spark.hadoop.fs.default.name=file:///" ... -- importance  ... -of output.csv

You can also use the full URI with the schema to address any filesystem for both input and output files e.g.:

./bin/variant-spark --spark ... --conf "spark.hadoop.fs.default.name=file:///" ... -- importance  -if hdfs:///user/data/input.csv ... -of output.csv

Running examples

There are multiple methods for running variant-spark examples

Manual Examples

variant-spark comes with a few example scripts in the scripts directory that demonstrate how to run its commands on sample data.

There are a few small data sets in the data directory suitable for running on a single machine. For example:

./examples/command-line/local_run-importance-ch22.sh

runs variable importance command on a small sample of the chromosome 22 VCF file (from 1000 Genomes Project)

The full-size examples require a cluster environment (the scripts are configured to work with Spark on YARN).

The data required for the examples can be obtained from the data folder https://github.com/aehrc/VariantSpark/tree/master/data

This repository uses the git Large File Support extension, which needs to be installed first (see: https://git-lfs.github.com/)

Clone the variant-spark-data repository and then install the test data into your Hadoop filesystem using:

./install-data

By default, the sample data will installed into the variant-spark-data\input sub-directory of your HDFS home directory.

You can choose a different location by setting the VS_DATA_DIR environment variable.

After the test data has been successfully copied to HDFS you can run examples scripts, e.g.:

./examples/command-line/yarn_run-importance-ch22.sh

Note: if you installed the data to a non-default location the VS_DATA_DIR needs to be set accordingly when running the examples

VariantSpark on the cloud

VariantSpark can easily be used in AWS and Azure. For more examples and information, check the cloud folder. For a quick start, check the few pointers below.

AWS Marketplace

VariantSpark is now available on AWS Marketplace. Please read the Guidlines for specifications and step-by-step instructions.

Azure Databricks

VariantSpark can be easily deployed in Azure Databricks through the button below. Please read the VariantSpark Azure manual for specifications and step-by-step instructions.

Contributions

JsonRfAnalyser

JsonRfAnalyser is a Python program that looks into the JSON RandomForest model and lists variables on each tree and branch. Please read README to see the complete list of functionalities.

WebVisualiser

rfview.html is a web program (run locally on your machine) where you can upload the JSON model produced by variant-spark and it visualizes trees in the model. You can identify which tree to be visualized. Node color and node labels could be set to different parameters such as the number of samples in the node or the node impurity. It uses vis.js for tree Visualisation.

variantspark's People

Contributors

Stargazers

Watchers

Forkers

plyte bigdataguru lynnlangit piotsu ppillay ianbarnett8 schaudge jamesrcounts daonv akhileshkaushal stjordanis mischalundberg csiro-scientific-computing nikolayvoronchikhin icmlab yatish0833 bhosking piotrszul sigmarkarl ambarishk bmjhayward aarsh64 alaminzju stefanocardinale johnpoh bobwatt jaingam shokofehvs vargeus tiagoooliveira a7032018 genostack ruixiangliu khileshchauhan dmumpuu davidglevy analyticsworld1 mtimofeeva liu5796796 christinaxu2017 jjrun-11

variantspark's Issues

Add scalacheck to the build

Add scala check to the build (based on spark one).
Update the build, travis-ci and contribution guides.
Make sure all the scala files conform.

Integration with HAIL

I would like to run Importance analysis on HAIL VariantDataset.

Fix readthedocs build to include pydocs strings.

In the current build of readthedocs the autogenerated documenation for python packages is blank.
I suspect it's becuase the required dependencies are not available in readthedocs virtual envn and need to be configured.

Make -rre (randomize equal) option enabled by default

Make -rre (randomize equal) option enabled by default, so that there is no bias for variables with low indexes.

Add FAQ to address potential vs-emr install issues for VS on AWS EMR

FAQ for OSX users
Q: Do I need to download the entire source from GitHub to install VS on AWS EMR?
A: No, but it's the simplest way to get the files you need for the install

Q: If I get some kind of permissions error when I attempt to install vs-emr what should I do?
A: Install using sudo -H pip install --user ./python

Q: If I attempt to use vs-emr and I get the command not found error what should I do?
A: Run sudo find / -name "vs-emr" to find the install path and then run vs-emrusing the full path, for example sudo /private/var/root/.local/bin/vs-emr

Q: If I attempt to create an EMR cluster and it fails, read the error message.
A1: If error is default config not found, then mkdir ~/.vs_emr, copy config.yaml into that new directory cp conf/config.min.yaml ~/.vs_emr/config.yaml, then edit config.yaml to replace values as required (do NOT use quotes around values)
A2: If error is unknown options: --<some parameter>,<some value>, etc..., verify your version of the awscli client tool aws --version, minimum supported version is aws-cli 1.10.22. To upgrade awscli run sudo -H pip install awscli --upgrade --user
TIP: you can attempt to run the generated awscli code directly in your terminal (i.e. without vs-emr) to get more detailed error messages. An example is at the end of this note. You can remove EMR options as needed, depending on your version of awscli.

Q: How do I know that the vs-emr command succeeded in creating a cluster?
A. You will see a "ClusterId" value (as shown below). TIP: Remember to immediately submit your analysis job as the cluster is set to auto-terminate.

Add support for ingesting non-VCF data sources using DataBricks scala API

At present only VCF files are exposed using the scala API for ingesting feature data. It would be useful to allow easy ingestion of parquet files as this would broaden the usefulness of VariantSpark beyond genomics. The class ParquetFeatureSource appears to offer this functionality already but is not available within the Databricks environment.

VariantSpark extended to support continuous labels

VariantSpark currently only supports categorical labels. While binning continuous variables to use as multiple categories, is a workaround, fully supporting continuous labels would be preferable.

Built-in support for FDR (False Discovery Rate) calculation

An option to request calculation for FDR (False Discovery Rate) for importance on permutated labels.
(It should be possible to build the permutated forests in parallel)

Release variants to PyPI

Release the python API to PyPI and automated dev releases to Travis-CI.

Implement extraction method for pair-wise interactions from the tree

From the JSON tree file extract pair-wise interacting features by making a pair-wise list and counting co-occurrence in the tree. Potentially weighting them by importance.

AIR - unbiased gini-based importance score - Algorithm

This procedure provides a Gini-based variable importance method that corrects bias for different number of categories (minor-allele-frequency bias in GWAS) and also shows some promising results regarding correlation issues.

the idea is to create pseudo variables for each variable in the dataset by permuting the values of the variables and adding them to the model. Then run the random forest model, and subtract the importance of the pseudo variables from that of the original variable (and by that subtract the bias)

The addition of variables is done only theoretically where in practice no variables are added to the model, saving runtime and memory usage.

A link to the paper:
https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/bty373/4994791

The procedure is implemented in R in ranger. Here is an example of how to use it:
ranger(data=data,dependent.variable.name = "y", importance = "impurity_corrected")

The "impurity_corrected"

The procedure works as follows:

Before fitting the RF, a single random reordering � of the sample IDs is performed
instead of sampling mTry variables from {1,...,p} we sample mTry variables from {1,...,2p}
at a given node in a tree we check which variable and which value to split according to in the regular manner, while if the index sampled (in the subset in step 2) is 1<i<=p we use variable X_(i) as usual and if p+1<i<=2p we use variable X_(i)*= X_(i-p) with the permuted sample ID performed in step 1 (i.e the labels are permuted).
Calculate mean Gini decrease as usual (Gini importance) I_G for each of the 2p variables. This results in 2p importance scores.
Calculate the new importance score of X_(i) as AIR(X_i)=I_G(X_(i))-I_G(X_(i)*)

now we want to produce a null distribution for computing p-values. for that we take all the resulting negative AIR importances and mirror the results (creating a set of importances derived from all the negative scores and the absolute values of those scores).
We use this set of importance score to compute an Empirical Cumulative Distribution.

We then check the importances of each of the variables to see where it falls in this distribution. this is the p-value for that score. (in simpler words, we have a list of scores resulting from the negatives and the absolute value of the negative, we order this list, and now the p-value of an importance score is its rank in this list divided by the length of the list).

Since the list resulted from the mirroring of the variables (counting all the variables that we not given a score as 0, which can be added to this list) should be very big (roughly the number of variables in the model) we can extract very small p-values from it, which should be enough . if it is still not enough, it might be worth to consider making the procedure twice, and by this double the number of scores in the estimated null distribution (which in this case averaging the importance scores of the two runs for making the results more stable could be done as well)

I'm happy to answer if you have any questions for it as I figure my explanation might not be the best :)

Implement Gradient Boosted Trees

Implement Gradient Boosted Trees and GBT based importance analysis.

ImportanceAnalysis fails on EMR with 64 core workers and bgz encoded files on S3.

While reading the bgz encoded vcf file from S3 variant-spark fails with the following exception:

com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.SdkClientException: Unable to execute HTTP request: Timeout waiting for connection from pool
at com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.http.AmazonHttpClient$RequestExecutor.handleRetryableException(AmazonHttpClient.java:1069)
at com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeHelper(AmazonHttpClient.java:1035)
at com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.http.AmazonHttpClient$RequestExecutor.doExecute(AmazonHttpClient.java:742)
at com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeWithTimer(AmazonHttpClient.java:716)
at com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.http.AmazonHttpClient$RequestExecutor.execute(AmazonHttpClient.java:699)
at com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.http.AmazonHttpClient$RequestExecutor.access$500(AmazonHttpClient.java:667)
at com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.http.AmazonHttpClient$RequestExecutionBuilderImpl.execute(AmazonHttpClient.java:649)
at com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:513)
at com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:4169)
at com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:4116)
at com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.services.s3.AmazonS3Client.getObjectMetadata(AmazonS3Client.java:1237)
at com.amazon.ws.emr.hadoop.fs.s3.lite.call.GetObjectMetadataCall.perform(GetObjectMetadataCall.java:24)
at com.amazon.ws.emr.hadoop.fs.s3.lite.call.GetObjectMetadataCall.perform(GetObjectMetadataCall.java:10)
at com.amazon.ws.emr.hadoop.fs.s3.lite.executor.GlobalS3Executor.execute(GlobalS3Executor.java:82)
at com.amazon.ws.emr.hadoop.fs.s3.lite.AmazonS3LiteClient.invoke(AmazonS3LiteClient.java:176)
at com.amazon.ws.emr.hadoop.fs.s3.lite.AmazonS3LiteClient.getObjectMetadata(AmazonS3LiteClient.java:94)
at com.amazon.ws.emr.hadoop.fs.s3.lite.AbstractAmazonS3Lite.getObjectMetadata(AbstractAmazonS3Lite.java:39)
at com.amazon.ws.emr.hadoop.fs.s3n.Jets3tNativeFileSystemStore.retrieveMetadata(Jets3tNativeFileSystemStore.java:211)
at sun.reflect.GeneratedMethodAccessor330.invoke(Unknown Source)

Optimised tree growing method

I recommend the following improvement to VariantSpark Random Forest importance analysis.

Compute and write importance score to a file after building every 1000 tree.
Automatically identify when enough tree has been built. If implementing the first suggestion then we can compare importance score at each step (1000 trees built) with the importance scores computed in the previous step. if little change has happened then we can stop building more trees.
Frequently (every -rbs tree) dump models (built trees) to disk and allowing to integrate previously built models in a new run. If the process crash half way produced model can be used in the next run.

Multiple vs binary variables

Could VariantSpark be used for multiple labels, instead of just binary? Thanks

Add generation of fake families

FakeFamily improvements

Some ideas on how to improve fake family generation:

Add reading all files from HDFS (including ped and spec)
Review and improve hail support for phased genotypes
Added one pass generation of independent mutations( generate mutation for all individuals in one pass and distribute them per individual)
Scala/Python API for fake family generation
Performance improvements for offspring genotype generation (use indexed sequence rather than HashMap)
Add mutations based on fasta (actual sequence file)

VariantSpark_HipsterIndex_Spark2.scala does not work with Spark 2.2

The R code in the sample notebook does not work with Spark 2.2:

Support for ./. in VCF files

VariantSpark may not accept ./. - we need to decide on a strategy to handle those, e.g. just assume hom ref.

Consider using linalg distributed matrices for pairwise operations

Consider if using distributed matrices would be better for pairwise operations.
Both in implementation and in API (that is add conversion to distributed matrices and implement operations on them).

Add support for unbounded ordered variables

Add support for unbounded ordered variables in trees.

Suggest adding folders for cloud vendor scripts and notebooks

As we have examples on AWS EMR and Databricks (on AWS), it might be useful to add child folders below the \cloud folder for vendor setup scripts and example notebooks.

Something like this
\cloud
\AWS-EMR
\Databricks-AWS
\Databricks-Azure

Output file line ending

Currently the output file of importance analysis "-of" uses windows line ending ("\n\r").
I suggest using Linux line ending instead which is compatible with Linux tools such as AWK.

Output trees in JSON format

Implement a readable output e.g. JSON for the trained trees. This will help evaluate the resulting interactions and visualise them.

Add API for Pairwise Operations

Implement the following PairWise operations:

Distance Metrics (Euclidean and Manhattan)
Shared Variants Counts:
-- SharedAltCount - counts the number of alternative alleles shared by two genotypes. Can be 0, 1 or two with summation for all variants.
-- AtLeastOneAltShardCount - count the number of variants with at last one shared alt allele between two genotypes.

Include in:

command line
Scala API
Hail Integration API

NoClassDefFoundError after running VariantSpark (CLI)

After installing the VariantSpark environment (Java, Spark, Scala) and building VariantSpark using documentation, execution of VariantSpark from the terminal results in:

Exception in thread "main" java.lang.NoClassDefFoundError: scala/reflect/ManifestFactory$
at au.csiro.variantspark.cli.VariantSparkApp$.main(VariantSparkApp.scala:23)

This can be caused by not setting the SPARK_HOME environment variable to:

export SPARK_HOME=

Using a package manager such as Homebrew may sidestep this problem.

Adding ALT field to output file

In the output file of importance analysis, each variant is identified with its site that is the combination of CHR_POS. However, there are cases that multiple bi-allelic variants appear in the same site. It would be great to add the ALT field to the output file such that each variant is identified as CHR_POS_ALT. This can resolve the ambiguity.

Develop benchmarking datasets

Develop synthetic datasets/models that would allow to compare the 'power' and the ability to detect significant variables in under various conditions (e.g. interactions, noise, correlated variables, etc). In particular to compare agains traditional GWAS methods like single locus logistic regression.

Add description to this repo

From other repo "VariantSpark is a framework for applying Spark-based Machine Learning methods to whole-genome variant information."

You may also want to add a link - http://bioinformatics.csiro.au/variantspark

You may want to add keywords for searching #bioinformatics #Apache-Spark, etc...

Add support for categorical variables (un-ordered)

Add support for categorical variables (un-ordered) in trees.

Prepare variant-spark for external contribution

Add support for mixing variable types.

Add support for mixing variable types in tree. That is being able use combination of different features (categorical, continuous etc).

Setting up a forum for community to chat

e.g. https://gitter.im/

Integrate python variants tests with Travis CI

FakeFamily - simulation of sequencing errors

Add functionality to simulate sequencing errors to a phase and unphased genotypes (vcf files).
Types of errors and other parameters to be defined.

Refactor Scala API

Add support for continuous variables

Add support for continuous variables for trees.

Publish the vs-emr to pypi

Convert it to a proper python package to avoid all the installation issues.
And: optionally publish to pypi

RandomForests has //TODO items

There are a number of //TODO comments in the RandomForest file. Are these open action items for the code base?

Write a concise user documenation

ImportError: cannot import name 'VariantsContext'

Hello,

I get the error:

ImportError: cannot import name 'VariantsContext'

when I try to execute the following code:

from varspark import VariantsContext

could you please help?
Thanks

Python API

A python API needs to be added

Update python variants README, meta and samples

Finalise distribution and deployments

More unit tests required

The current code is 40% covered by unit tests, the rest need to be added

Write concise user documentation

Write a concise user manual using sphinx and make is available at readthedocs.
The audience should be primarily bioinformaticians.
The scope should coverL

installing variant spark
running importance analysis with command line an python APIs
references for available functions and options
running on cluster and clouds (AWS, Databricks)
a simple developer guide (building testing).

The proposed outline is here (CSIRO internal only): https://docs.google.com/document/d/1kIZ69VoDTdQhhC0eLBYM_w3v9bG77ZB2lbOYfdXKaWU/edit

The preview of the version committed to the feature branch [ _i76_docs ] is at: https://variantspark.readthedocs.io/en/i76_docs/

Enable running variant spark on AWS EMR

This will include:

variant-spark setup on EMR
fixing any issue that may arise (like local filesystem writes etc)
scrips to facilitate creation of variant-spark enabled EMR cluster
scrips to submit variant-spark commands to EMR cluster
demo and documentation