hail-is / hail Goto Github PK

Cloud-native genomic dataframes and batch computing

License: MIT License

R 0.02% Scala 43.85% Python 49.12% Java 0.36% Shell 0.71% HTML 1.26% CSS 0.01% Makefile 0.51% Batchfile 0.06% C++ 1.20% Jupyter Notebook 1.74% XSLT 0.04% Dockerfile 0.12% C 0.01% JavaScript 0.01% SCSS 0.25% Emacs Lisp 0.01% HCL 0.73%

bioinformatics genetics genomics gwas hail python software vcf

hail's People

Stargazers

Watchers

Forkers

jtkoskel tomwhite sun-shan gitter-badger jigold lfrancioli jbloom22 cseed danking tpoterba dvrana fedja alexb-3 zamoshchin conerade67 khernyo danfengc waqali palc andgan johnc1231 shusson snewhouse phaniac ikaka89 harlixxy robinqi ismailm genomicsplc jkeebler melsiddieg patrick-schultz liameabbott astheeggeggs bavila-broad shaileshmeticulosity simexin xiaoli0 macarthur-lab adiamb biocodings jameswarren tankmermaid outlierbio charlescolley kkaneda ttasa alxndgb ishwarvh appcoreopc narendrameena jmc734 dlegor lynnlangit hhuang2018 scalavision dmmiller612 catoverdrive bag-cnag fractal76 mjkrajsman2 jiamaozheng jackgoldsmith4 raonyguimaraes andyscho sulicon shulik7 legaultmarc lingrui zaczap xtmgah mrgoogol wtsi-hgi anirudh-k therockstardba maccum rcownie sahimin konradjk neo4reo mptrepanier chrisvittal christinachen94 henrydavidge dd-david jarnokettunen ahiduchick knmkr bpow liuxiaodong amoliu fw1121 akotlar shutianxu linjiaqin bcajes krcurtis tianyunwang daniel-goldstein datnoor

hail's Issues

load FASTA files

From @cseed on August 26, 2015 14:3

Copied from original issue: cseed#5

Add functionality to compile TeX sources in docs to PDF

From @alexb-3 on September 30, 2015 20:42

Note that we are using BibTeX with the common bibfile.bib, and that we need a make tool that runs PDFTeX and BibTeX until no errors remain.

Copied from original issue: cseed/hail#68

load .gtf and .reg files

From @cseed on August 26, 2015 14:14

Copied from original issue: cseed#8

support multiallelics

From @cseed on August 26, 2015 14:20

fix representation for variant information
support upstream deletion allele
normalize on import: left-align, split complex

Should be done concurrently with:
https://github.com/cseed/k3/issues/5

Copied from original issue: cseed/hail#9

compute PCA on subset of variants

From @cseed on August 26, 2015 14:7

Copied from original issue: cseed#6

improve plot integration

From @cseed on September 29, 2015 15:43

Added gqbydp --plot option. Some issues to resolve:

need to test
installDir detection won't work with shadowJar
handle case of output file in Hadoop or Hadoop URI

Copied from original issue: cseed/hail#62

code coverage is insufficient

From @cseed on August 26, 2015 14:46

Depends on:
https://github.com/cseed/k3/issues/4

Copied from original issue: cseed/hail#17

correctly handle temporary files in tests

From @cseed on September 24, 2015 17:16

make sure they get cleaned up
don't write to fixed file names in /tmp

Copied from original issue: cseed/hail#60

fixed variant type Boolean methods, added variantType method, added tests

From @jbloom22 on August 26, 2015 22:19

Also, rebased.

Copied from original issue: cseed/hail/pull/25

test issue

load BCF files

From @cseed on September 3, 2015 13:47

Copied from original issue: cseed/hail#46

metrics on DP and GQ

From @cseed on August 26, 2015 14:29

Requested by Andrea, "mean DP, standard deviation DP, mean GQ, standard deviation GQ".

Copied from original issue: cseed/hail#12

(vague) sort out python and R integration story

From @cseed on August 26, 2015 16:4

The goal is to let analysts use our methods and data representations coded in Spark within python and R.

Tachyon might might be part of the story:

http://tachyon-project.org/

Copied from original issue: cseed/hail#20

sex awareness in computing stats

From @jbloom22 on September 29, 2015 17:15

In computing statistics like sample and variant qc, we should not treat X/Y like the autosomes. A simple solution is to only compute on the autosome, but we should discuss with the community how to make the tools more powerful via sex awareness. for example, we could split all stats by sex, or just sex chromosome stats by sex, ...

Copied from original issue: cseed/hail#64

more robust VCF parser

From @cseed on September 3, 2015 13:49

The other option for this, https://github.com/cseed/k3/issues/3 and https://github.com/cseed/k3/issues/46 is to use Hadoop-SAM and htsjdk.

Copied from original issue: cseed/hail#47

Create framework for testing tsv output files

From @jbloom22 on September 30, 2015 20:21

Currently in MendelErrorsSuite we only test the internal representation, and not the correctness of the output files. The same issue arises whenever we support write but not read. We should come up with a strategy for testing in such cases.

Copied from original issue: cseed/hail#67

automated performance testing

From @cseed on August 26, 2015 14:51

Waiting on suitable machines (Intel spark cluster, cloud access, etc.)

measure size of stored data
compressed vs uncompressed (gzip parquet, lz4 in SparkyVSM, etc.)
compute cost (or at least compute-hrs)
compare best-case (e.g. gzip -cd file.vcf.gz | wc -l vs LoadVCF)

Copied from original issue: cseed/hail#18

Added GQByDPBins. ManagedVSM.flatMapWithKeys not implemented (likely

From @cseed on August 26, 2015 21:36

to be deleted soon).

Copied from original issue: cseed/hail/pull/24

can't use assert for input validation

From @cseed on September 1, 2015 15:48

Need general strategy for error reporting. We can't use asserts for input data validation (e.g., reading tsv files, command line flags, etc.) Need to generate good error messages with feedback about line numbers, etc.

Copied from original issue: cseed/hail#42

isSNP, etc. in Variant are wrong

From @cseed on August 26, 2015 13:48

Copied from original issue: cseed#1

TestNGSuite "assert" is not serializable

From @tpoterba on September 23, 2015 15:26

It would be nice to be able to map assert in RDDs, this is impossible right now because assertions are not serializable

Copied from original issue: cseed/hail#52

(vague) generate synthetic data for method verification

From @cseed on September 1, 2015 15:16

We should be on the lookout for an opportunity to write methods to generate synthetic data for method verification. Mixed logistic regression might be a good example.

Copied from original issue: cseed/hail#41

import VCF header

From @cseed on August 26, 2015 14:43

Store entire header. Parse contig at least. Expand sampleIds: Array[String] to store metadata.

Copied from original issue: cseed/hail#16

build abstractions for reading and writing tsv files

From @cseed on August 31, 2015 22:34

Copied from original issue: cseed/hail#39

decide on documentation format

From @cseed on August 28, 2015 15:57

What format is the documentation going to be? We may more than one system.

Reference on what methods are computing. This needs to include mathematical equations.
Tutorial on how to use k3.
Documentation for k3 developers (how to set up k3, git workflow, etc.) Markdown should be fine for this.
Use Scaladoc to document APIs used by developers against our codebase.

Copied from original issue: cseed/hail#31

Linear regression

From @jbloom22 on October 14, 2015 1:23

Cotton, I haven't implemented good testing yet, but would appreciate some initial feedback on the code as I take a break to write up the math. I've done some comparison with R and it checks out so far. It runs on the command line, but you'll need to create a covariate file from your PCs.

Copied from original issue: cseed/hail/pull/81

DSL for filtering samples and variants

From @cseed on August 26, 2015 14:37

When designing, consider plinkseq --mask option:

https://atgu.mgh.harvard.edu/plinkseq/masks.shtml

and bcftools filter expressions:

https://samtools.github.io/bcftools/bcftools.html#expressions

Copied from original issue: cseed/hail#15

compute HWE

From @cseed on August 26, 2015 14:27

Copied from original issue: cseed/hail#11

documentation

From @cseed on August 28, 2015 15:57

We need to document k3 for users as a mathematician would, that is to say, precisely. It should include precise mathematical definitions for everything we can compute that can be cut-and-pasted into plots of our output.

Depends on https://github.com/cseed/k3/issues/31.

Copied from original issue: cseed/hail#32

create a VCF to interval list method

From @jbloom22 on September 23, 2015 21:35

See the VCF methods here:
http://broadinstitute.github.io/picard/index.html

Copied from original issue: cseed/hail#54

native transformation to split multiallelic variants

From @cseed on August 26, 2015 13:50

The goal is to not rely on bcftools during import. Need to think through PL which bcftools doesn't get right.

Copied from original issue: cseed#2

add sex check

From @cseed on August 28, 2015 15:51

If you're not Alex and you're working on this, talk to Alex. He has some ideas about how this should be done.

Copied from original issue: cseed/hail#30

set up profiling

From @cseed on August 26, 2015 14:56

including for Spark jobs. Consider YourKit:

https://www.yourkit.com/

Copied from original issue: cseed/hail#19

compute HWE using exact test

From @alexb-3 on August 26, 2015 16:10

Copied from original issue: cseed/hail#21

per-site linear regression

From @cseed on August 26, 2015 14:11

Output should at least include p-values.

Copied from original issue: cseed#7

compute Mendelian violations

From @cseed on August 26, 2015 14:23

Requires loading .ped/.fam files.

Copied from original issue: cseed/hail#10

create BGEN parser

From @cseed on September 22, 2015 16:16

See:

http://www.well.ox.ac.uk/~gav/bgen_format/
http://www.well.ox.ac.uk/~gav/bgen_format/bgen_format_v1.2.html

Copied from original issue: cseed/hail#50

do family-based association testing/TDT

From @cseed on August 31, 2015 22:22

Compare plink:

http://pngu.mgh.harvard.edu/~purcell/plink/fanal.shtml#tdt

Copied from original issue: cseed/hail#37

efficiently load compressed VCFs

From @cseed on August 26, 2015 13:52

See here:

https://github.com/HadoopGenomics/Hadoop-BAM/blob/master/examples/src/main/java/org/seqdoop/hadoop_bam/examples/TestVCF.java

This allows us to stop using gatling.

Copied from original issue: cseed#3

export VariantDataset as VCF

From @cseed on September 22, 2015 20:35

Copied from original issue: cseed/hail#51

disable Spark INFO output by default

From @cseed on September 24, 2015 16:26

Copied from original issue: cseed/hail#56

% of variants with GQ > 20 per DP bins

From @cseed on August 26, 2015 14:29

Requested by Andrea.

Copied from original issue: cseed/hail#13

create tool for finding de novo variants in vcf containing trios

From @jbloom22 on September 1, 2015 18:20

Kaitlin has written a python tool that can serve as model. From Kaitlin:

The most recent version is attached (3.93). The biggest issue that I've yet to resolve is how to handle multi-allelic lines above tri-allelic. Gets into nightmare territory quickly.

To run this, you'll need three things:

VCF of interest
PED file for the families in the VCF
The ESP variant counts file that I made (.gz for the moment since it is so large)
You can find this file here: /humgen/atgu1/fs03/wip/kaitlin/all_ESP_counts_5.28.13.txt

The command line argument should look like this:
python de_novo_finder_3.py all_ESP_counts_5.28.13.txt

I suggest specifying an output file. There are a few optional flags that you can use to adjust things in the script.

-v, --annotatevariants_VEP: If you have VEP annotations in the ANNOTATION column of the VCF, this will pull out and print the gene name and mutation type

-t, --thresh: The PL threshold set for the next most likely genotype in the child. This gets after how confident you want the het call to be in the child. Default is a PL threshold of 20, but you can adjust that up or down if you'd like.

-c, --minchildAB: I require that the heterozygous child has at least 20% alternative reads. You can adjust that with this.

-d, --depthratio: I require that the depth of coverage in the child is at least 1/10th that of the combined parental depth. You can adjust that with this (integer input for the 1/x).

-m, --prob(dn)metric: The minimum p(DN) that you will accept. I have it set at 0.05 and you could adjust it up. Due to the validation likelihoods that were added, you won't get anything below 0.05.

-p, --maxparentAB: I require that the parents, who should both be homozygous reference, have no more than 5% alternative reads. You can adjust that with this flag.

-a, --annotatevariants: If you have SnpEff annotations in the ANNOTATION column of the VCF, this will pull out and print the gene name and mutation type

Copied from original issue: cseed/hail#44