Code Monkey home page Code Monkey logo

hail-is / hail Goto Github PK

View Code? Open in Web Editor NEW
938.0 938.0 235.0 120.9 MB

Cloud-native genomic dataframes and batch computing

Home Page: https://hail.is

License: MIT License

R 0.02% Scala 43.85% Python 49.12% Java 0.36% Shell 0.71% HTML 1.26% CSS 0.01% Makefile 0.51% Batchfile 0.06% C++ 1.20% Jupyter Notebook 1.74% XSLT 0.04% Dockerfile 0.12% C 0.01% JavaScript 0.01% SCSS 0.25% Emacs Lisp 0.01% HCL 0.73%
bioinformatics genetics genomics gwas hail python software vcf

hail's People

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

hail's Issues

improve plot integration

From @cseed on September 29, 2015 15:43

Added gqbydp --plot option. Some issues to resolve:

  • need to test
  • installDir detection won't work with shadowJar
  • handle case of output file in Hadoop or Hadoop URI

Copied from original issue: cseed/hail#62

metrics on DP and GQ

From @cseed on August 26, 2015 14:29

Requested by Andrea, "mean DP, standard deviation DP, mean GQ, standard deviation GQ".

Copied from original issue: cseed/hail#12

sex awareness in computing stats

From @jbloom22 on September 29, 2015 17:15

In computing statistics like sample and variant qc, we should not treat X/Y like the autosomes. A simple solution is to only compute on the autosome, but we should discuss with the community how to make the tools more powerful via sex awareness. for example, we could split all stats by sex, or just sex chromosome stats by sex, ...

Copied from original issue: cseed/hail#64

Create framework for testing tsv output files

From @jbloom22 on September 30, 2015 20:21

Currently in MendelErrorsSuite we only test the internal representation, and not the correctness of the output files. The same issue arises whenever we support write but not read. We should come up with a strategy for testing in such cases.

Copied from original issue: cseed/hail#67

automated performance testing

From @cseed on August 26, 2015 14:51

Waiting on suitable machines (Intel spark cluster, cloud access, etc.)

  • measure size of stored data
  • compressed vs uncompressed (gzip parquet, lz4 in SparkyVSM, etc.)
  • compute cost (or at least compute-hrs)
  • compare best-case (e.g. gzip -cd file.vcf.gz | wc -l vs LoadVCF)

Copied from original issue: cseed/hail#18

can't use assert for input validation

From @cseed on September 1, 2015 15:48

Need general strategy for error reporting. We can't use asserts for input data validation (e.g., reading tsv files, command line flags, etc.) Need to generate good error messages with feedback about line numbers, etc.

Copied from original issue: cseed/hail#42

TestNGSuite "assert" is not serializable

From @tpoterba on September 23, 2015 15:26

It would be nice to be able to map assert in RDDs, this is impossible right now because assertions are not serializable

Copied from original issue: cseed/hail#52

(vague) generate synthetic data for method verification

From @cseed on September 1, 2015 15:16

We should be on the lookout for an opportunity to write methods to generate synthetic data for method verification. Mixed logistic regression might be a good example.

Copied from original issue: cseed/hail#41

import VCF header

From @cseed on August 26, 2015 14:43

Store entire header. Parse contig at least. Expand sampleIds: Array[String] to store metadata.

Copied from original issue: cseed/hail#16

decide on documentation format

From @cseed on August 28, 2015 15:57

What format is the documentation going to be? We may more than one system.

  • Reference on what methods are computing. This needs to include mathematical equations.
  • Tutorial on how to use k3.
  • Documentation for k3 developers (how to set up k3, git workflow, etc.) Markdown should be fine for this.
  • Use Scaladoc to document APIs used by developers against our codebase.

Copied from original issue: cseed/hail#31

Linear regression

From @jbloom22 on October 14, 2015 1:23

Cotton, I haven't implemented good testing yet, but would appreciate some initial feedback on the code as I take a break to write up the math. I've done some comparison with R and it checks out so far. It runs on the command line, but you'll need to create a covariate file from your PCs.

Copied from original issue: cseed/hail/pull/81

compute HWE

From @cseed on August 26, 2015 14:27

Copied from original issue: cseed/hail#11

documentation

From @cseed on August 28, 2015 15:57

We need to document k3 for users as a mathematician would, that is to say, precisely. It should include precise mathematical definitions for everything we can compute that can be cut-and-pasted into plots of our output.

Depends on https://github.com/cseed/k3/issues/31.

Copied from original issue: cseed/hail#32

add sex check

From @cseed on August 28, 2015 15:51

If you're not Alex and you're working on this, talk to Alex. He has some ideas about how this should be done.

Copied from original issue: cseed/hail#30

create tool for finding de novo variants in vcf containing trios

From @jbloom22 on September 1, 2015 18:20

Kaitlin has written a python tool that can serve as model. From Kaitlin:

The most recent version is attached (3.93). The biggest issue that I've yet to resolve is how to handle multi-allelic lines above tri-allelic. Gets into nightmare territory quickly.

To run this, you'll need three things:

  1. VCF of interest
  2. PED file for the families in the VCF
  3. The ESP variant counts file that I made (.gz for the moment since it is so large)
    You can find this file here: /humgen/atgu1/fs03/wip/kaitlin/all_ESP_counts_5.28.13.txt

The command line argument should look like this:
python de_novo_finder_3.py all_ESP_counts_5.28.13.txt

I suggest specifying an output file. There are a few optional flags that you can use to adjust things in the script.

-v, --annotatevariants_VEP: If you have VEP annotations in the ANNOTATION column of the VCF, this will pull out and print the gene name and mutation type

-t, --thresh: The PL threshold set for the next most likely genotype in the child. This gets after how confident you want the het call to be in the child. Default is a PL threshold of 20, but you can adjust that up or down if you'd like.

-c, --minchildAB: I require that the heterozygous child has at least 20% alternative reads. You can adjust that with this.

-d, --depthratio: I require that the depth of coverage in the child is at least 1/10th that of the combined parental depth. You can adjust that with this (integer input for the 1/x).

-m, --prob(dn)metric: The minimum p(DN) that you will accept. I have it set at 0.05 and you could adjust it up. Due to the validation likelihoods that were added, you won't get anything below 0.05.

-p, --maxparentAB: I require that the parents, who should both be homozygous reference, have no more than 5% alternative reads. You can adjust that with this flag.

-a, --annotatevariants: If you have SnpEff annotations in the ANNOTATION column of the VCF, this will pull out and print the gene name and mutation type

Copied from original issue: cseed/hail#44

Tp bgen reader

From @tpoterba on October 1, 2015 18:27

Read my code and weep at its bloat

Copied from original issue: cseed/hail/pull/70

upgrade to jvm 1.8 and scala 2.11

From @cseed on August 26, 2015 14:32

Need to test with gradle and IntelliJ. Last time I tried this there was a dependency conflict. I don't remember the details.

Copied from original issue: cseed/hail#14

compute phasing from trios

From @cseed on August 31, 2015 22:23

This will involve deciding how to represent phasing information.

Copied from original issue: cseed/hail#38

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.