genomicsengland / gel-coverage Goto Github PK
View Code? Open in Web Editor NEWLicense: Apache License 2.0
License: Apache License 2.0
When there is no gene having a transcript that passes the filters of flag
and biotype
, the coverage analysis fails with an error Incorrect BED file!
.
To reproduce run:
BW=/genomes/analysis/by_date/2015-01-12/RAREP01846/LP2000717-DNA_A01/coverage/LP2000717-DNA_A01.bw
PANEL=5763f2ea8f620350a1996048
PANEL_VERSION=1.0
So far we have only tested panels from PanelApp and custom gene lists by preparing the files with the very specific regions required for the coverage analysis.
Run the whole exome mode on a whole genome bigwig file and record execution times.
Also run the other modes with whole genome files.
We had a meeting a few weeks ago to discuss requirements for reporting coverage. Details here:
https://cnfl.extge.co.uk/display/BIO/Panel+Coverage+Meeting
And from the V&F meeting, coverage across the union of coding genes (or those used in Tiering), with padding of 15bp (or customisable) up and downstream of each exon is needed.
Genes in PanelApp are classified as HighConfidence, MediumConfidence and LowConfidence. Usually we only need those having HighConfidence.
Subtasks:
Acceptance criteria:
Non goals:
When running:
sh ./run_coverage_analysis.sh /genomes/analysis/by_date/2015-10-23/0000074602/LP2000860-DNA_B04/coverage/LP2000860-DNA_B04.bw /genomes/analysis/by_date/2015-10-23/0000074602/LP2000860-DNA_B04/coverage/LP2000860-DNA_B04.bw_558c24a2bb5a166f63868678_1.0.json
We get :
558c24a2bb5a166f63868678 1.0 DEBUG:requests.packages.urllib3.connectionpool:Starting new HTTP connection (1): bioinfo.hpc.cam.ac.uk DEBUG:requests.packages.urllib3.connectionpool:http://bioinfo.hpc.cam.ac.uk:80 "HEAD /cellbase HTTP/1.1" 302 0 DEBUG:requests.packages.urllib3.connectionpool:Starting new HTTP connection (1): 10.5.8.201 DEBUG:requests.packages.urllib3.connectionpool:http://10.5.8.201:8080 "HEAD /cellbase-4.5.0-rc HTTP/1.1" 302 0 DEBUG:root:Getting gene list from PanelApp... DEBUG:root:Gene list obtained from PanelApp of 0! INFO:root:Gene list to analyse: INFO:root:0 genes to analyse DEBUG:root:Creating the BigWig reader... DEBUG:root:BigWig reader created! INFO:root:Sanity checks on the configuration OK! INFO:root:Starting coverage analysis INFO:root:Building gene annotations bed file from CellBase... Traceback (most recent call last): File "/genomes/software/apps/gel-coverage/scripts/bigwig_analyser", line 103, in <module> main() File "/genomes/software/apps/gel-coverage/scripts/bigwig_analyser", line 88, in main (results, bed) = gel_coverage_engine.run() File "/genomes/software/apps/python2.7-tiering/lib/python2.7/site-packages/gelcoverage/runner.py", line 419, in run bed = self.cellbase_helper.make_exons_bed(self.gene_list, has_chr_prefix=self.bigwig_reader.has_chr_prefix) File "/genomes/software/apps/python2.7-tiering/lib/python2.7/site-packages/gelcoverage/tools/cellbase_helper.py", line 169, in make_exons_bed raise SystemError("Input gene list is not correct") SystemError: Input gene list is not correct
On the 8th of February 2017
The configuration file will read two type of parameters:
Subtasks:
Acceptance criteria:
Non goals:
When aggregating across chromosomes by calculating the median we need to weight by the size of the chromosome. Otherwise the results will be biased towards smaller chromosomes.
In order to improve performance and improve robustness towards possible CellBase server failures we want to cache the results obtained for each gene.
Subtasks:
Acceptance criteria:
coverage.sh should take an argument to indicate whether the genome is aligned to GRCh37 or GRCh38. Steps run by the script should then take this genome version into account.
It appears that coverage_summary.py takes an --assembly
argument (see here)
The uneveness of coverage is important for the quality control of cancer samples. This is a whole genome metric that is being obtained now by an independent script:
/genomes/software/apps/bwtool/bwtool summary 100000 $BW $BWTOOLS -header -with-sum-of-squares -with-quantiles
R --slave --args $BWTOOL < /genomes/scratch/asosinsky/dev/RMSDcov.R > ./tmp 2>&1
RMSD_COV=cat ./tmp| cut -f 1 -d ' '
At initialization there is a log listing all the genes to be analysed. This log is only intended for panel or gene list analysis, but it is written also with other configurations. In GRCh37 after extracting from CellBase all the genes to be analysed, these are printed without problem, but this is not the case for GRCh38. Too long list to join items with comma I guess... so this causes a segmentation error in Python interpreter.
When obtaining the transcripts to analyse from a list of genes we want to filter the transcripts for each gene by two criteria: basic flag (GeneCode genes) and biotype (@Antonior26 you need to define this). This filtering should be configurable through a configuration file.
When the input is a list of transcripts the filtering will not take place.
Dependencies: #3
Acceptance criteria:
Think is clear :)
Chromosome identifiers might use a "chr" prefix or not. In the coverage pipeline we are using different conventions across the different modules.
BAMs are usually using the chromosomes without prefix (though some exceptions have been observed /genomes/by_date/2016-11-18/RAREP40001/LP2000274-DNA_B11/).
The module that converts a BAM into a coverage bigwig file is expecting a BAM with chromosomes not using the prefix, while the bigwig output contains chromosomes with prefix. The conversion from chromosomes without prefix to chromosomes with prefix is too simplistic and some contigs in the reference might be wrongly transformed (https://github.com/genomicsengland/gel-coverage/blob/master/bam2wig/get_chr_sizes.py).
The module that analyses the bigwig relies on the reference extracted from CellBase which uses strictly chromosomes without the prefix, but it has to deal with a bigwig using the prefix. The output statistics chromosomes don't have the prefix.
The region on which to compute a coverage analysis is obtained from a list of genes. This list of genes might come from a panel from PanelApp, a list of genes or no list, meaning all of the available genes. Add support for the three types of input. The priority of parameters is as follows: panel, gene list and whole exome. Meaning for instance that if a panel and a gene list is provided, the panel will overrride the gene list.
Subtasks:
Acceptance criteria:
Non goals:
The region to analyse might be defined by a set of transcripts as opposed to a set of genes. Even if conceptually is straightforward with the existing implementation based on gene list this it might not (further analysis required).
Subtasks:
Acceptance criteria:
Replace the hard-coded relative path to the config file by a new command line parameter. This should be a required parameter.
Given that the input files are bigwig this files are quite big even when subsetting very specific regions in the genome (over 100 MB). As we want to avoid uploading these "big" files to GitHub we would like to make those available through OpenCGA.
The implemented tests should take care of finding this files, downloading them and setting them in place to run the test.
Subtasks:
Acceptance:
Non-goals:
Some regions in the genome may not be included in the bigwig. We need to detect those cases and report them in the JSON as unsequenced regions.
A coding region not included in the bigwig that was detected during testing (Test 3) which corresponds to /genomes/analysis/by_date/2016-09-27/HX01166477/CancerLP3000079-DNA_F03_NormalLP3000067-DNA_C12/coverage/LP3000079-DNA_F03.bw
is chrGL000191.1:50009-50281
See: http://grch37.ensembl.org/Homo_sapiens/Gene/Summary?db=core;g=ENSG00000215699;r=GL000191.1:1-106433;t=ENST00000374457
Verified that the corresponding BAM does not include this "chromosome". Only includes 1-22, X, Y and MT.
We need to get aggregated metrics by chromosome on the whole genome and on the coding region (or otherwise panel or gene list if provided). This should be stored not as an additional level in the hierarchy, but as a separate entry in the results section.
Also, aggregation across all chromosomes and across autosomes is desired.
When calculating the whole genome coverage metrics there are some regions in the genome that only contain unknown bases (Ns) in the reference genome and that add a bias towards underestimation.
There are two alternatives:
Add standard deviation to the set of statistics computed on coverage data.
We need to compute coverage statistics on exons including 15bp up and downstream.
Subtasks:
Acceptance criteria:
Non goals:
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.