hail-is / hail Goto Github PK
View Code? Open in Web Editor NEWCloud-native genomic dataframes and batch computing
Home Page: https://hail.is
License: MIT License
Cloud-native genomic dataframes and batch computing
Home Page: https://hail.is
License: MIT License
From @alexb-3 on September 30, 2015 20:42
Note that we are using BibTeX with the common bibfile.bib
, and that we need a make tool that runs PDFTeX and BibTeX until no errors remain.
Copied from original issue: cseed/hail#68
From @cseed on August 26, 2015 14:20
Should be done concurrently with:
https://github.com/cseed/k3/issues/5
Copied from original issue: cseed/hail#9
From @cseed on September 29, 2015 15:43
Added gqbydp --plot option. Some issues to resolve:
Copied from original issue: cseed/hail#62
From @cseed on August 26, 2015 14:46
Depends on:
https://github.com/cseed/k3/issues/4
Copied from original issue: cseed/hail#17
From @cseed on September 24, 2015 17:16
Copied from original issue: cseed/hail#60
From @jbloom22 on August 26, 2015 22:19
Also, rebased.
Copied from original issue: cseed/hail/pull/25
From @cseed on September 3, 2015 13:47
Copied from original issue: cseed/hail#46
From @cseed on August 26, 2015 14:29
Requested by Andrea, "mean DP, standard deviation DP, mean GQ, standard deviation GQ".
Copied from original issue: cseed/hail#12
From @cseed on August 26, 2015 16:4
The goal is to let analysts use our methods and data representations coded in Spark within python and R.
Tachyon might might be part of the story:
Copied from original issue: cseed/hail#20
From @jbloom22 on September 29, 2015 17:15
In computing statistics like sample and variant qc, we should not treat X/Y like the autosomes. A simple solution is to only compute on the autosome, but we should discuss with the community how to make the tools more powerful via sex awareness. for example, we could split all stats by sex, or just sex chromosome stats by sex, ...
Copied from original issue: cseed/hail#64
From @cseed on September 3, 2015 13:49
The other option for this, https://github.com/cseed/k3/issues/3 and https://github.com/cseed/k3/issues/46 is to use Hadoop-SAM and htsjdk.
Copied from original issue: cseed/hail#47
From @jbloom22 on September 30, 2015 20:21
Currently in MendelErrorsSuite we only test the internal representation, and not the correctness of the output files. The same issue arises whenever we support write but not read. We should come up with a strategy for testing in such cases.
Copied from original issue: cseed/hail#67
From @cseed on August 26, 2015 14:51
Waiting on suitable machines (Intel spark cluster, cloud access, etc.)
gzip -cd file.vcf.gz | wc -l
vs LoadVCF
)Copied from original issue: cseed/hail#18
From @cseed on August 26, 2015 21:36
to be deleted soon).
Copied from original issue: cseed/hail/pull/24
From @cseed on September 1, 2015 15:48
Need general strategy for error reporting. We can't use asserts for input data validation (e.g., reading tsv files, command line flags, etc.) Need to generate good error messages with feedback about line numbers, etc.
Copied from original issue: cseed/hail#42
From @tpoterba on September 23, 2015 15:26
It would be nice to be able to map assert
in RDDs, this is impossible right now because assertions are not serializable
Copied from original issue: cseed/hail#52
From @cseed on September 1, 2015 15:16
We should be on the lookout for an opportunity to write methods to generate synthetic data for method verification. Mixed logistic regression might be a good example.
Copied from original issue: cseed/hail#41
From @cseed on August 26, 2015 14:43
Store entire header. Parse contig at least. Expand sampleIds: Array[String]
to store metadata.
Copied from original issue: cseed/hail#16
From @cseed on August 31, 2015 22:34
Copied from original issue: cseed/hail#39
From @cseed on August 28, 2015 15:57
What format is the documentation going to be? We may more than one system.
Copied from original issue: cseed/hail#31
From @jbloom22 on October 14, 2015 1:23
Cotton, I haven't implemented good testing yet, but would appreciate some initial feedback on the code as I take a break to write up the math. I've done some comparison with R and it checks out so far. It runs on the command line, but you'll need to create a covariate file from your PCs.
Copied from original issue: cseed/hail/pull/81
From @cseed on August 26, 2015 14:37
When designing, consider plinkseq --mask
option:
https://atgu.mgh.harvard.edu/plinkseq/masks.shtml
and bcftools filter expressions:
https://samtools.github.io/bcftools/bcftools.html#expressions
Copied from original issue: cseed/hail#15
From @cseed on August 26, 2015 14:27
Copied from original issue: cseed/hail#11
From @cseed on August 28, 2015 15:57
We need to document k3 for users as a mathematician would, that is to say, precisely. It should include precise mathematical definitions for everything we can compute that can be cut-and-pasted into plots of our output.
Depends on https://github.com/cseed/k3/issues/31.
Copied from original issue: cseed/hail#32
From @jbloom22 on September 23, 2015 21:35
See the VCF methods here:
http://broadinstitute.github.io/picard/index.html
Copied from original issue: cseed/hail#54
From @cseed on August 28, 2015 15:51
If you're not Alex and you're working on this, talk to Alex. He has some ideas about how this should be done.
Copied from original issue: cseed/hail#30
From @cseed on August 26, 2015 14:56
including for Spark jobs. Consider YourKit:
Copied from original issue: cseed/hail#19
From @alexb-3 on August 26, 2015 16:10
Copied from original issue: cseed/hail#21
From @cseed on August 26, 2015 14:23
Requires loading .ped
/.fam
files.
Copied from original issue: cseed/hail#10
From @cseed on September 22, 2015 16:16
See:
http://www.well.ox.ac.uk/~gav/bgen_format/
http://www.well.ox.ac.uk/~gav/bgen_format/bgen_format_v1.2.html
Copied from original issue: cseed/hail#50
From @cseed on August 31, 2015 22:22
Compare plink:
http://pngu.mgh.harvard.edu/~purcell/plink/fanal.shtml#tdt
Copied from original issue: cseed/hail#37
From @cseed on September 22, 2015 20:35
Copied from original issue: cseed/hail#51
From @cseed on September 24, 2015 16:26
Copied from original issue: cseed/hail#56
From @jbloom22 on September 1, 2015 18:20
Kaitlin has written a python tool that can serve as model. From Kaitlin:
The most recent version is attached (3.93). The biggest issue that I've yet to resolve is how to handle multi-allelic lines above tri-allelic. Gets into nightmare territory quickly.
To run this, you'll need three things:
The command line argument should look like this:
python de_novo_finder_3.py all_ESP_counts_5.28.13.txt
I suggest specifying an output file. There are a few optional flags that you can use to adjust things in the script.
-v, --annotatevariants_VEP: If you have VEP annotations in the ANNOTATION column of the VCF, this will pull out and print the gene name and mutation type
-t, --thresh: The PL threshold set for the next most likely genotype in the child. This gets after how confident you want the het call to be in the child. Default is a PL threshold of 20, but you can adjust that up or down if you'd like.
-c, --minchildAB: I require that the heterozygous child has at least 20% alternative reads. You can adjust that with this.
-d, --depthratio: I require that the depth of coverage in the child is at least 1/10th that of the combined parental depth. You can adjust that with this (integer input for the 1/x).
-m, --prob(dn)metric: The minimum p(DN) that you will accept. I have it set at 0.05 and you could adjust it up. Due to the validation likelihoods that were added, you won't get anything below 0.05.
-p, --maxparentAB: I require that the parents, who should both be homozygous reference, have no more than 5% alternative reads. You can adjust that with this flag.
-a, --annotatevariants: If you have SnpEff annotations in the ANNOTATION column of the VCF, this will pull out and print the gene name and mutation type
Copied from original issue: cseed/hail#44
From @tpoterba on October 1, 2015 18:27
Read my code and weep at its bloat
Copied from original issue: cseed/hail/pull/70
From @cseed on August 26, 2015 14:32
Need to test with gradle and IntelliJ. Last time I tried this there was a dependency conflict. I don't remember the details.
Copied from original issue: cseed/hail#14
From @tpoterba on October 2, 2015 19:21
there's something in MapReduceMethod I think needs to change
aggregateBySampleWithKeys(zeroValue)((e, v, s, g) => seqOp(e, g), combOp)
that's in VariantSampleMatrix
but in MapReduceMethod, we have:
override def seqOpWithKeys(v: Variant, s: Int, g: Genotype, acc: T): T =
fold(mapWithKeys(v, s, g), acc)
Copied from original issue: cseed/hail#74
From @cseed on August 26, 2015 22:44
Copied from original issue: cseed/hail#26
From @cseed on August 31, 2015 22:23
This will involve deciding how to represent phasing information.
Copied from original issue: cseed/hail#38
From @jbloom22 on September 29, 2015 17:21
Once we handle multi-allelic sites, we will need to adapt mendel errors so that, for example, it does not double count errors in multi-allelic trios.
Copied from original issue: cseed/hail#65
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.