genomicsengland / gel-coverage Goto Github PK

View Code? Open in Web Editor NEW

2.0 2.0 0.0 116.31 MB

License: Apache License 2.0

Python 86.81% R 6.05% Shell 6.08% Perl 0.78% Dockerfile 0.28%

gel-coverage's People

Contributors

Stargazers

Watchers

gel-coverage's Issues

Add support for chromosome chr prefix on cellbase helper

Error when all transcripts are filtered out

When there is no gene having a transcript that passes the filters of flag and biotype, the coverage analysis fails with an error Incorrect BED file!.

To reproduce run:

BW=/genomes/analysis/by_date/2015-01-12/RAREP01846/LP2000717-DNA_A01/coverage/LP2000717-DNA_A01.bw 
PANEL=5763f2ea8f620350a1996048 
PANEL_VERSION=1.0

Stress test: run the analysis on a whole genome file

So far we have only tested panels from PanelApp and custom gene lists by preparing the files with the very specific regions required for the coverage analysis.
Run the whole exome mode on a whole genome bigwig file and record execution times.
Also run the other modes with whole genome files.

Requirements meeting

We had a meeting a few weeks ago to discuss requirements for reporting coverage. Details here:

https://cnfl.extge.co.uk/display/BIO/Panel+Coverage+Meeting

And from the V&F meeting, coverage across the union of coding genes (or those used in Tiering), with padding of 15bp (or customisable) up and downstream of each exon is needed.

Parametrize the exclusion of genes from the input panel/s based on confidence

Genes in PanelApp are classified as HighConfidence, MediumConfidence and LowConfidence. Usually we only need those having HighConfidence.

Subtasks:

Add this parameter to the CLI
Implement filtering when querying PanelApp

Acceptance criteria:

Genes are properly filtered

Non goals:

Filter genes by any parameter not present in PanelApp

SystemError: Input gene list is not correct

When running:

sh ./run_coverage_analysis.sh /genomes/analysis/by_date/2015-10-23/0000074602/LP2000860-DNA_B04/coverage/LP2000860-DNA_B04.bw /genomes/analysis/by_date/2015-10-23/0000074602/LP2000860-DNA_B04/coverage/LP2000860-DNA_B04.bw_558c24a2bb5a166f63868678_1.0.json

We get :

558c24a2bb5a166f63868678 1.0 DEBUG:requests.packages.urllib3.connectionpool:Starting new HTTP connection (1): bioinfo.hpc.cam.ac.uk DEBUG:requests.packages.urllib3.connectionpool:http://bioinfo.hpc.cam.ac.uk:80 "HEAD /cellbase HTTP/1.1" 302 0 DEBUG:requests.packages.urllib3.connectionpool:Starting new HTTP connection (1): 10.5.8.201 DEBUG:requests.packages.urllib3.connectionpool:http://10.5.8.201:8080 "HEAD /cellbase-4.5.0-rc HTTP/1.1" 302 0 DEBUG:root:Getting gene list from PanelApp... DEBUG:root:Gene list obtained from PanelApp of 0! INFO:root:Gene list to analyse: INFO:root:0 genes to analyse DEBUG:root:Creating the BigWig reader... DEBUG:root:BigWig reader created! INFO:root:Sanity checks on the configuration OK! INFO:root:Starting coverage analysis INFO:root:Building gene annotations bed file from CellBase... Traceback (most recent call last): File "/genomes/software/apps/gel-coverage/scripts/bigwig_analyser", line 103, in <module> main() File "/genomes/software/apps/gel-coverage/scripts/bigwig_analyser", line 88, in main (results, bed) = gel_coverage_engine.run() File "/genomes/software/apps/python2.7-tiering/lib/python2.7/site-packages/gelcoverage/runner.py", line 419, in run bed = self.cellbase_helper.make_exons_bed(self.gene_list, has_chr_prefix=self.bigwig_reader.has_chr_prefix) File "/genomes/software/apps/python2.7-tiering/lib/python2.7/site-packages/gelcoverage/tools/cellbase_helper.py", line 169, in make_exons_bed raise SystemError("Input gene list is not correct") SystemError: Input gene list is not correct

On the 8th of February 2017

Support configuration file and create one example configuration file

The configuration file will read two type of parameters:

Usually static parameters that we don't want to leave at CLI level (e.g.: assembly)
CellBase and PanelApp parameters that will be eventually converted into the JSON format supported by clients.

Subtasks:

Define the parameters in the configuration file
Implement read and usage of configuration

Acceptance criteria:

Parameters in configuration file are agreed
Application uses the configuration file
There is an example config file in the repository

Non goals:

Replace all parameters in CLI

Uneveness of coverage metric is not being weighted and the result might be biased

When aggregating across chromosomes by calculating the median we need to weight by the size of the chromosome. Otherwise the results will be biased towards smaller chromosomes.

Cache reference genome BED obtained from CellBase

In order to improve performance and improve robustness towards possible CellBase server failures we want to cache the results obtained for each gene.

Subtasks:

Identify the caching unit (could be organism-assembly-gene)
Implement cache storage
Implement cache reading

Acceptance criteria:

Caching works at gene level. For instance if we cache a panel and then run a whole exome, genes in the panel will be read from the cache, while the rest will be queried into CellBase

Add GRCh38 support to wrapper script

coverage.sh should take an argument to indicate whether the genome is aligned to GRCh37 or GRCh38. Steps run by the script should then take this genome version into account.

It appears that coverage_summary.py takes an --assembly argument (see here)

Add whole genome metric of coverage uneveness

The uneveness of coverage is important for the quality control of cancer samples. This is a whole genome metric that is being obtained now by an independent script:

/genomes/software/apps/bwtool/bwtool summary 100000 $BW $BWTOOLS -header -with-sum-of-squares -with-quantiles
R --slave --args $BWTOOL < /genomes/scratch/asosinsky/dev/RMSDcov.R > ./tmp 2>&1
RMSD_COV=cat ./tmp| cut -f 1 -d ' '

Segmentation fault for GRCh38 coverage analysis

At initialization there is a log listing all the genes to be analysed. This log is only intended for panel or gene list analysis, but it is written also with other configurations. In GRCh37 after extracting from CellBase all the genes to be analysed, these are printed without problem, but this is not the case for GRCh38. Too long list to join items with comma I guess... so this causes a segmentation error in Python interpreter.

Filter transcripts to be used by basic flag (GeneCode genes) and also biotype

When obtaining the transcripts to analyse from a list of genes we want to filter the transcripts for each gene by two criteria: basic flag (GeneCode genes) and biotype (@Antonior26 you need to define this). This filtering should be configurable through a configuration file.

When the input is a list of transcripts the filtering will not take place.

Dependencies: #3

Acceptance criteria:

Transcripts are properly filtered

support for multiple panels

Think is clear :)

Coverage pipeline usage of different chromosome identifiers conventions

Chromosome identifiers might use a "chr" prefix or not. In the coverage pipeline we are using different conventions across the different modules.
BAMs are usually using the chromosomes without prefix (though some exceptions have been observed /genomes/by_date/2016-11-18/RAREP40001/LP2000274-DNA_B11/).
The module that converts a BAM into a coverage bigwig file is expecting a BAM with chromosomes not using the prefix, while the bigwig output contains chromosomes with prefix. The conversion from chromosomes without prefix to chromosomes with prefix is too simplistic and some contigs in the reference might be wrongly transformed (https://github.com/genomicsengland/gel-coverage/blob/master/bam2wig/get_chr_sizes.py).
The module that analyses the bigwig relies on the reference extracted from CellBase which uses strictly chromosomes without the prefix, but it has to deal with a bigwig using the prefix. The output statistics chromosomes don't have the prefix.

Compute coverage statistics at gene level using the union transcript

Compute coverage statistics at panel level

Parametrize analysis region by: panels, gene list or whole exome

The region on which to compute a coverage analysis is obtained from a list of genes. This list of genes might come from a panel from PanelApp, a list of genes or no list, meaning all of the available genes. Add support for the three types of input. The priority of parameters is as follows: panel, gene list and whole exome. Meaning for instance that if a panel and a gene list is provided, the panel will overrride the gene list.

Subtasks:

Add appropriate parameters
Implement the retrieval of genes from PanelApp
Implement the retrieval of all available genes from CellBase

Acceptance criteria:

Users can use any of the alternatives to provide a list of genes

Non goals:

Allow the user to provide a list of transcripts

Parametrize analysis region by: transcript list

The region to analyse might be defined by a set of transcripts as opposed to a set of genes. Even if conceptually is straightforward with the existing implementation based on gene list this it might not (further analysis required).

Subtasks:

Analyse how current BED generation approach can be adapted to a list of transcripts. Do we need another implementation of the BED creation?
Retrieve all the information for a list of transcripts from CellBase. Transcript end-point does not have search implemented (this might need clarification from @dapregi).

Acceptance criteria:

Analysis works on a list of transcripts

Parametrize the configuration file

Replace the hard-coded relative path to the config file by a new command line parameter. This should be a required parameter.

Automate test data download by making it available in OpenCGA

Given that the input files are bigwig this files are quite big even when subsetting very specific regions in the genome (over 100 MB). As we want to avoid uploading these "big" files to GitHub we would like to make those available through OpenCGA.
The implemented tests should take care of finding this files, downloading them and setting them in place to run the test.

Format output in computer-friendly format

Subtasks:

Define output format
Define statistics to be computed

Acceptance:

Format and statistics are documented and agreed

Non-goals:

Implement it

Report not sequenced regions

Some regions in the genome may not be included in the bigwig. We need to detect those cases and report them in the JSON as unsequenced regions.

A coding region not included in the bigwig that was detected during testing (Test 3) which corresponds to /genomes/analysis/by_date/2016-09-27/HX01166477/CancerLP3000079-DNA_F03_NormalLP3000067-DNA_C12/coverage/LP3000079-DNA_F03.bw is chrGL000191.1:50009-50281
See: http://grch37.ensembl.org/Homo_sapiens/Gene/Summary?db=core;g=ENSG00000215699;r=GL000191.1:1-106433;t=ENST00000374457

Verified that the corresponding BAM does not include this "chromosome". Only includes 1-22, X, Y and MT.

Add statistics at chromosome level for whole genome and exome regions

We need to get aggregated metrics by chromosome on the whole genome and on the coding region (or otherwise panel or gene list if provided). This should be stored not as an additional level in the hierarchy, but as a separate entry in the results section.
Also, aggregation across all chromosomes and across autosomes is desired.

Support NonN regions bed file for the whole genome coverage analysis

When calculating the whole genome coverage metrics there are some regions in the genome that only contain unknown bases (Ns) in the reference genome and that add a bias towards underestimation.

There are two alternatives:

Providing this regions as a bed file
Retrieve this information from CellBase

Calculate standard deviation at several genomic levels

Add standard deviation to the set of statistics computed on coverage data.

Include a parametrizable number of bases up and downstream of each exon

We need to compute coverage statistics on exons including 15bp up and downstream.

Subtasks:

Parametrize the number of bases to cover in the config file
Implement a simple version not dealing with exon overlapping

Acceptance criteria:

The coverage statistics available include the added regions

Non goals:

Deal with overlapping regions when adding the padding

genomicsengland / gel-coverage Goto Github PK

gel-coverage's People

Contributors

Stargazers

Watchers

gel-coverage's Issues

Recommend Projects

Recommend Topics

Recommend Org