Code Monkey home page Code Monkey logo

gel-coverage's People

Contributors

antonior26 avatar enricserra avatar javild avatar mattwellie avatar neilgoodgame avatar pfurio avatar priesgo avatar rsultana avatar

Stargazers

 avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

gel-coverage's Issues

Error when all transcripts are filtered out

When there is no gene having a transcript that passes the filters of flag and biotype, the coverage analysis fails with an error Incorrect BED file!.

To reproduce run:

BW=/genomes/analysis/by_date/2015-01-12/RAREP01846/LP2000717-DNA_A01/coverage/LP2000717-DNA_A01.bw 
PANEL=5763f2ea8f620350a1996048 
PANEL_VERSION=1.0

Stress test: run the analysis on a whole genome file

So far we have only tested panels from PanelApp and custom gene lists by preparing the files with the very specific regions required for the coverage analysis.
Run the whole exome mode on a whole genome bigwig file and record execution times.
Also run the other modes with whole genome files.

Parametrize the exclusion of genes from the input panel/s based on confidence

Genes in PanelApp are classified as HighConfidence, MediumConfidence and LowConfidence. Usually we only need those having HighConfidence.

Subtasks:

  • Add this parameter to the CLI
  • Implement filtering when querying PanelApp

Acceptance criteria:

  • Genes are properly filtered

Non goals:

  • Filter genes by any parameter not present in PanelApp

SystemError: Input gene list is not correct

When running:

sh ./run_coverage_analysis.sh /genomes/analysis/by_date/2015-10-23/0000074602/LP2000860-DNA_B04/coverage/LP2000860-DNA_B04.bw /genomes/analysis/by_date/2015-10-23/0000074602/LP2000860-DNA_B04/coverage/LP2000860-DNA_B04.bw_558c24a2bb5a166f63868678_1.0.json

We get :

558c24a2bb5a166f63868678 1.0 DEBUG:requests.packages.urllib3.connectionpool:Starting new HTTP connection (1): bioinfo.hpc.cam.ac.uk DEBUG:requests.packages.urllib3.connectionpool:http://bioinfo.hpc.cam.ac.uk:80 "HEAD /cellbase HTTP/1.1" 302 0 DEBUG:requests.packages.urllib3.connectionpool:Starting new HTTP connection (1): 10.5.8.201 DEBUG:requests.packages.urllib3.connectionpool:http://10.5.8.201:8080 "HEAD /cellbase-4.5.0-rc HTTP/1.1" 302 0 DEBUG:root:Getting gene list from PanelApp... DEBUG:root:Gene list obtained from PanelApp of 0! INFO:root:Gene list to analyse: INFO:root:0 genes to analyse DEBUG:root:Creating the BigWig reader... DEBUG:root:BigWig reader created! INFO:root:Sanity checks on the configuration OK! INFO:root:Starting coverage analysis INFO:root:Building gene annotations bed file from CellBase... Traceback (most recent call last): File "/genomes/software/apps/gel-coverage/scripts/bigwig_analyser", line 103, in <module> main() File "/genomes/software/apps/gel-coverage/scripts/bigwig_analyser", line 88, in main (results, bed) = gel_coverage_engine.run() File "/genomes/software/apps/python2.7-tiering/lib/python2.7/site-packages/gelcoverage/runner.py", line 419, in run bed = self.cellbase_helper.make_exons_bed(self.gene_list, has_chr_prefix=self.bigwig_reader.has_chr_prefix) File "/genomes/software/apps/python2.7-tiering/lib/python2.7/site-packages/gelcoverage/tools/cellbase_helper.py", line 169, in make_exons_bed raise SystemError("Input gene list is not correct") SystemError: Input gene list is not correct

On the 8th of February 2017

Support configuration file and create one example configuration file

The configuration file will read two type of parameters:

  • Usually static parameters that we don't want to leave at CLI level (e.g.: assembly)
  • CellBase and PanelApp parameters that will be eventually converted into the JSON format supported by clients.

Subtasks:

  1. Define the parameters in the configuration file
  2. Implement read and usage of configuration

Acceptance criteria:

  • Parameters in configuration file are agreed
  • Application uses the configuration file
  • There is an example config file in the repository

Non goals:

  • Replace all parameters in CLI

Cache reference genome BED obtained from CellBase

In order to improve performance and improve robustness towards possible CellBase server failures we want to cache the results obtained for each gene.

Subtasks:

  • Identify the caching unit (could be organism-assembly-gene)
  • Implement cache storage
  • Implement cache reading

Acceptance criteria:

  • Caching works at gene level. For instance if we cache a panel and then run a whole exome, genes in the panel will be read from the cache, while the rest will be queried into CellBase

Add GRCh38 support to wrapper script

coverage.sh should take an argument to indicate whether the genome is aligned to GRCh37 or GRCh38. Steps run by the script should then take this genome version into account.

It appears that coverage_summary.py takes an --assembly argument (see here)

Add whole genome metric of coverage uneveness

The uneveness of coverage is important for the quality control of cancer samples. This is a whole genome metric that is being obtained now by an independent script:

/genomes/software/apps/bwtool/bwtool summary 100000 $BW $BWTOOLS -header -with-sum-of-squares -with-quantiles
R --slave --args $BWTOOL < /genomes/scratch/asosinsky/dev/RMSDcov.R > ./tmp 2>&1
RMSD_COV=cat ./tmp| cut -f 1 -d ' '

Segmentation fault for GRCh38 coverage analysis

At initialization there is a log listing all the genes to be analysed. This log is only intended for panel or gene list analysis, but it is written also with other configurations. In GRCh37 after extracting from CellBase all the genes to be analysed, these are printed without problem, but this is not the case for GRCh38. Too long list to join items with comma I guess... so this causes a segmentation error in Python interpreter.

Filter transcripts to be used by basic flag (GeneCode genes) and also biotype

When obtaining the transcripts to analyse from a list of genes we want to filter the transcripts for each gene by two criteria: basic flag (GeneCode genes) and biotype (@Antonior26 you need to define this). This filtering should be configurable through a configuration file.

When the input is a list of transcripts the filtering will not take place.

Dependencies: #3

Acceptance criteria:

  • Transcripts are properly filtered

Coverage pipeline usage of different chromosome identifiers conventions

Chromosome identifiers might use a "chr" prefix or not. In the coverage pipeline we are using different conventions across the different modules.
BAMs are usually using the chromosomes without prefix (though some exceptions have been observed /genomes/by_date/2016-11-18/RAREP40001/LP2000274-DNA_B11/).
The module that converts a BAM into a coverage bigwig file is expecting a BAM with chromosomes not using the prefix, while the bigwig output contains chromosomes with prefix. The conversion from chromosomes without prefix to chromosomes with prefix is too simplistic and some contigs in the reference might be wrongly transformed (https://github.com/genomicsengland/gel-coverage/blob/master/bam2wig/get_chr_sizes.py).
The module that analyses the bigwig relies on the reference extracted from CellBase which uses strictly chromosomes without the prefix, but it has to deal with a bigwig using the prefix. The output statistics chromosomes don't have the prefix.

Parametrize analysis region by: panels, gene list or whole exome

The region on which to compute a coverage analysis is obtained from a list of genes. This list of genes might come from a panel from PanelApp, a list of genes or no list, meaning all of the available genes. Add support for the three types of input. The priority of parameters is as follows: panel, gene list and whole exome. Meaning for instance that if a panel and a gene list is provided, the panel will overrride the gene list.

Subtasks:

  • Add appropriate parameters
  • Implement the retrieval of genes from PanelApp
  • Implement the retrieval of all available genes from CellBase

Acceptance criteria:

  • Users can use any of the alternatives to provide a list of genes

Non goals:

  • Allow the user to provide a list of transcripts

Parametrize analysis region by: transcript list

The region to analyse might be defined by a set of transcripts as opposed to a set of genes. Even if conceptually is straightforward with the existing implementation based on gene list this it might not (further analysis required).

Subtasks:

  • Analyse how current BED generation approach can be adapted to a list of transcripts. Do we need another implementation of the BED creation?
  • Retrieve all the information for a list of transcripts from CellBase. Transcript end-point does not have search implemented (this might need clarification from @dapregi).

Acceptance criteria:

  • Analysis works on a list of transcripts

Automate test data download by making it available in OpenCGA

Given that the input files are bigwig this files are quite big even when subsetting very specific regions in the genome (over 100 MB). As we want to avoid uploading these "big" files to GitHub we would like to make those available through OpenCGA.
The implemented tests should take care of finding this files, downloading them and setting them in place to run the test.

Report not sequenced regions

Some regions in the genome may not be included in the bigwig. We need to detect those cases and report them in the JSON as unsequenced regions.

A coding region not included in the bigwig that was detected during testing (Test 3) which corresponds to /genomes/analysis/by_date/2016-09-27/HX01166477/CancerLP3000079-DNA_F03_NormalLP3000067-DNA_C12/coverage/LP3000079-DNA_F03.bw is chrGL000191.1:50009-50281
See: http://grch37.ensembl.org/Homo_sapiens/Gene/Summary?db=core;g=ENSG00000215699;r=GL000191.1:1-106433;t=ENST00000374457

Verified that the corresponding BAM does not include this "chromosome". Only includes 1-22, X, Y and MT.

Add statistics at chromosome level for whole genome and exome regions

We need to get aggregated metrics by chromosome on the whole genome and on the coding region (or otherwise panel or gene list if provided). This should be stored not as an additional level in the hierarchy, but as a separate entry in the results section.
Also, aggregation across all chromosomes and across autosomes is desired.

Support NonN regions bed file for the whole genome coverage analysis

When calculating the whole genome coverage metrics there are some regions in the genome that only contain unknown bases (Ns) in the reference genome and that add a bias towards underestimation.

There are two alternatives:

  1. Providing this regions as a bed file
  2. Retrieve this information from CellBase

Include a parametrizable number of bases up and downstream of each exon

We need to compute coverage statistics on exons including 15bp up and downstream.

Subtasks:

  • Parametrize the number of bases to cover in the config file
  • Implement a simple version not dealing with exon overlapping

Acceptance criteria:

  • The coverage statistics available include the added regions

Non goals:

  • Deal with overlapping regions when adding the padding

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.