Code Monkey home page Code Monkey logo

comparative-annotation-toolkit's Introduction

pipeline

This project aims to provide a straightforward end-to-end pipeline that takes as input a HAL-format multiple whole genome alignment as well as a GFF3 file representing annotations on one high quality assembly in the HAL alignment, and produces a output GFF3 annotation on all target genomes chosen.

This pipeline is capable of running both on local cluster hardware as well as on common cloud infrastructure using the toil workflow engine. For full runs on many genomes, a decent amount of computational effort is required. Memory usage is moderate.

pipeline

Above is a flowchart schematic of the functionality of the CAT pipeline.

Installation

The pipeline can be installed by a simple pip install:

pip install git+https://github.com/ComparativeGenomicsToolkit/Comparative-Annotation-Toolkit.git

However, at this time, direct pip installation will mean that the luigi.cfg, logging.cfg, and test files will be buried in your python directory. I am still trying to figure out how to approach this problem. In the meantime, you may be better off instead cloning the directory and installing from your clone:

git clone https://github.com/ComparativeGenomicsToolkit/Comparative-Annotation-Toolkit.git
pip install -e Comparative-Annotation-Toolkit

If you want to do the direct pip installation, you can grab the config files from the repository and place them in whatever directory you want to execute from, or set the LUIGI_CONFIG_PATH environmental variable to point to their location. Or have an ugly log, your choice.

Either form of pip installation will install all of the python dependencies.

Dependencies (if not using Docker)

Below is a full breakdown of the required dependencies if you are not using Docker. Some of these can be challenging to get to all compile and work properly. In addition to the breakdown below, you may get guidance looking at the Dockerfile or looking at this list of installation commands generated by a helpful user.

By default, you don't need to worry about installing any of these. However, there are also binary dependencies that must be compiled and installed if you are not using the Docker container we provide.

  1. Kent toolkit. Follow the installation instructions there. Make sure you put the newly created ~/bin/$MACHTYPE directory on your path. All of the binaries required by CAT are available pre-compiled on the utilities page. The required tools are faToTwoBit gff3ToGenePred genePredToBed genePredToFakePsl bamToPsl transMapPslToGenePred pslPosTarget axtChain chainMergeSort pslMap pslRecalcMatch pslMapPostChain gtfToGenePred genePredToGtf pslCDnaFilter clusterGenes pslToBigPsl bedSort bedToBigBed wigToBigWig fasize.
  2. bedtools.
  3. samtools (1.3 or greater).
  4. Augustus. Make sure you are installing augustus >= 3.3.1. If you want to use Augustus CGP, install the development version from the Github repository. You need to follow the instructions to compile augustus in comparative augustus mode. This requires that you modify a few lines in the common.mk file, and also need to have sqlite3, lp-solve, bamtools, and libboost installed. If you are using ubuntu, this should work: apt-get install libboost-all-dev libboost sqlite3 libsqlite3-0 libsqlite3-dev libgsl0-dev lp-solve liblpsolve55-dev bamtools libbamtools-dev

After you have the primary augustus binaries compiled, add the directory to your path. Note that if you move the augustus binaries from their original location, you will need to set the AUGUSTUS_CONFIG_PATH global variable to point to the species directory.

You will also need to put the contents of the scripts directory on your path. Next, you need to compile the following auxiliary programs from the folder auxprogs:

  1. joingenes. Compiling this program will place it in the augustus binary directory.
  2. bam2hints. Compiling this program will place it in the augustus binary directory. Requires bamtools to be installed. If the bamtools headers are not at /usr/include/bamtools, you will need to modify the makefile.
  3. filterBam. Also requires the bamtools headers.
  4. bam2wig. Compiling this program will NOT place it in the augustus binary directory, you must do so yourself. This program requires you modify the makefile to explicitly point to your installation of htslib, bcftools, samtools, and tabix. Tabix is now packaged with htslib, and both are included in your kent directory at $kent/src/htslib/.
  5. homGeneMapping. This program must also have its makefile at $augustus/trunks/auxprogs/homGeneMapping/src/Makefile modified to turn on the BOOST = true and SQLITE = true flags. Then run make clean && make to recompile.
  6. There are a series of perl scripts that you need to place on your path from the $augustus/trunks/scripts directory: wig2hints.pl, exonerate2hints.pl, transMap2hints.pl, and join_mult_hints.pl.
  7. HAL toolkit. To install the HAL toolkit, you must also have the sonLib repository in the same parent directory. Compile sonLib first, then compile hal. Once hal is compiled, you need to have the binaries on your path.
  8. wiggletools. Used to combine RNA-seq expression in assembly hubs.
  9. sambamba. Used to name sort faster than samtools for hints building.
  10. exonerate. Used for protein alignments in hints building.

Conda/bioconda

Many of the above dependencies are on conda. However, you will still need to install the following things by hand:

  1. clusterGenes: The version on conda is too old.
  2. toil: The version on conda is too old (not python3 compatible).
  3. cat: You will need to install this repo with pip install after you enter the conda env (conda activate cattest).
  4. HAL: Not available on conda.
conda create -y -n cattest -c conda-forge -c bioconda -c defaults python=3.7 pyfasta luigi seaborn pandas \
ete3 pysam numpy scipy bx-python bcbio-gff biopython parasail-python configobj sqlalchemy \
samtools bamtools augustus exonerate wiggletools bedtools \
ucsc-fatotwobit ucsc-gff3togenepred ucsc-genepredtobed ucsc-genepredtofakepsl ucsc-bamtopsl ucsc-transmappsltogenepred \
ucsc-pslpostarget ucsc-axtchain ucsc-chainmergesort ucsc-pslmap ucsc-pslrecalcmatch ucsc-pslmappostchain \
ucsc-gtftogenepred ucsc-genepredtogtf ucsc-pslcdnafilter ucsc-psltobigpsl \
ucsc-bedsort ucsc-bedtobigbed ucsc-fasize ucsc-wigtobigwig ucsc-hgloadchain

In total, you must have all of the binaries and scripts listed below on your path. The pipeline will check for them before executing steps. hal2fasta halStats halLiftover exonerate faToTwoBit pyfasta gff3ToGenePred genePredToBed genePredToFakePsl bamToPsl exonerate2hints.pl blat2hints.pl transMapPslToGenePred join_mult_hints.pl pslPosTarget axtChain chainMergeSort pslMap pslRecalcMatch pslMapPostChain augustus transMap2hints.pl joingenes hal2maf gtfToGenePred genePredToGtf bedtools homGeneMapping pslCDnaFilter clusterGenes pslToBigPsl bedSort bedToBigBed sambamba wig2hints.pl pal2nal.pl

Running the pipeline

This pipeline makes use of Luigi to link the various steps together. First, start the luigid daemon:

luigid --background --logdir luigi_logs

Which provides the central scheduler as well as the web UI, which can be accessed at localhost:8082. If you don't want to use the daemon, which is highly recommended add the flag --local-scheduler to the invocation.

To run the test data, change directories to the CAT installation folder and do the following:

luigi --module cat RunCat --hal=test_data/vertebrates.hal --ref-genome=mm10 --workers=10 --config=test_data/test.config \ --work-dir test_install --out-dir test_install --local-scheduler --augustus --augustus-cgp --augustus-pb --assembly-hub > log.txt

The test should take around 30 minutes to execute. You can track progress in the log file.

GFF3 input

CAT requires valid GFF3 as input. The script programs/validate_gff3 can test that your GFF3 is both valid and meets the requirements of CAT. These requirements include:

The following keys are reserved, and have special meaning:

reserved_keys = ['gene_biotype',
                 'transcript_biotype',
                 'gene_name',
                 'gene_id',
                 'transcript_id',
                 'transcript_name',
                 'ID',
                 'Name',
                 'Parent']

The keys ID, Name and Parent are required for valid GFF3 and define the hierarchical relationship. The remaining keys, gene_biotype, transcript_biotype, gene_name, gene_id, transcript_id and transcript_name are also all required. In many cases you will not have common names, and so it is fine for transcript_name to equal transcript_id and for gene_name to equal gene_id. The biotypes can be whatever you want, but protein_coding is a special biotype that tells CAT this gene or transcript is coding. You must flag your protein coding genes and transcripts as protein_coding!

You may have any arbitrary set of keys and values in the GFF3 that are not the reserved keys. These keys and values will be propagated on to the resulting transcript in the CAT output GFF3.

If your GFF3 has duplicate transcript names, the pipeline will complain. One common cause of this is PAR locus genes. You will want to remove PAR genes -- If your GFF3 came from GENCODE, you should be able to do this: grep -v PAR_Y $gff > $gff.fixed.

Your GFF3 file should not have any gene_id identifiers split across multiple chromosomes. Ideally, your GFF3 also has no disjoint genes (genes whose transcripts do not overlap). The script validate_gff3 will warn about such genes, but it will not raise an error. The script fix_chrom_disjoint_genes will rename gene_id values that are shared on multiple chromosomes with a unique suffix.

Command line options

As described above, the primary method to executing the pipeline is to follow the invocation luigi --module cat RunCat --hal=${halfile} --ref-genome=${ref-genome} --config=${config}. Below are the flags that can modify execution and output.

Main options

--hal: Input HAL alignment file. (REQUIRED).

--ref-genome: Reference genome sequence name. Must be present in HAL. (REQUIRED).

--config: Path to the config file with annotations and extrinsic hints. See the config section for more information. (REQUIRED).

--binary-mode: How should CAT run its binary dependencies? Valid choices are "docker" (the default, in which case you must have a Docker daemon running) or "local" (in which case ensure your PATH contains the necessary dependencies).

--out-dir: Output directory. Defaults to ./cat_output.

--work-dir: Working directory. Defaults to ./cat_work. Stores all the intermediate files as well as the toil jobStore. Can be removed after completion (but not if you want to re-do any steps).

--target-genomes: List of genomes to use. If not set, all non-reference genomes in the HAL are used. Due to how luigi handles command line tuple parameters, this flag must be formatted as if it was a tuple being passed directly to python, single quoted. So, for example, if your target genomes were Human and Mouse, then you would pass --target-genomes='("Human", "Mouse")'. As always with python tuples, if you have only one member, you must have a trailing comma.

--workers: Number of local cores to use. If running toil in single_machine mode, care must be taken with the balance of this value and the --maxCores parameter.

Augustus config options

The augustus config files for all of the modes live in the CAT folder under augustus_cfgs. If you are running CAT from a folder that is not the installation folder, you will need to point CAT to these files directly.

--tm-cfg: Config file for AugustusTM. Defaults to augustus_cfgs/extrinsic.ETM1.cfg.

--tmr-cfg: Config file for AugustusTMR. Defaults to augustus_cfgs/extrinsic.ETM2.cfg.

--augustus-cgp-cfg-template: Config file template for AugustusCGP. Defaults to augustus_cfgs/cgp_extrinsic_template.cfg.

--pb-cfg": Config file for AugustusPB. Defaults to augustus_cfgs/extrinsic.M.RM.PB.E.W.cfg.

transMap options

--global-near-best: Adjusts the globalNearBest parameter passed to pslCDnaFilter. Defaults to 0.15. The globalNearBest algorithm determines which set of alignments are within a certain distance of the highest scoring alignment for a given source transcript. Making this value smaller will increase the number of alignments filtered out, decreasing the apparent paralogous alignment rate. Alignments which survive this filter are putatively paralogous.

AugustusTM(R) options

--augustus: Run AugustusTM(R)?

--augustus-species: What Augustus species do we want to use? If your species is not a mammal, please choose one of the species listed here.

--augustus-utr-off: AugustusTMR will crash trying to predict UTRs if your --augustus-species lacks a trained UTR model. You can check if $augustusDir/config/species/$augustusSecies/$augustusSpecies_utr_probs.pbl exists. If it does not, set this flag.

AugustusCGP options

--augustus-cgp: Run AugustusCGP?

--cgp-param: Parameters file after training CGP on the alignment. See the AugustusCGP section.

--maf-chunksize: Size to chunk HAL into. Larger values make the CGP jobs take longer, but reduce problems related to splitting in genic regions. Default is 2500000. If your HAL contains more than 10 or so genomes, reducing this value to 1000000 or so is a good idea to keep job run-times below an hour and avoid going over 8GB of RAM per job. For a 25-way alignment, I set this value to 750000.

--maf-overlap: How much overlap to use in HAL chunks. Larger values increase redundant predictions (which are merged). Default is 500000. For a 25-way alignment, I set this value to 150000.

--cgp-train-num-exons: Number of exons to require in the alignment subset used for training CGP. See the AugustusCGP section. Default is 5000.

AugustusPB options

--augustus-pb: Run AugustusPB? Will only run on genomes with IsoSeq data in the config file.

--pb-genome-chunksize: Size to chunk genome into. Default is 20000000.

--maf-overlap: How much overlap to use in genome chunks. Default is 500000.

Filtering and consensus finding options

--filter-overlapping-genes: Should genes that get flagged as overlapping be removed? After consensus finding is finished, instances of gene family collapse or paralog mis-assignment may lead to overlapping CDS intervals on different genes. This also in some instances may be a feature of the original annotation set. However, some annotation databases do not like this, so this flag will remove all such instances and resolve them down to one gene.

--overlapping-ignore-bases: This flag is passed to the tool clusterGenes as the ignoreBases parameter. This value represents the number of bases on the 3' and 5' of a transcript to ignore when generating gene clusters. Setting this value higher than 0 is recommended for smaller genomes that likely have real overlapping genes, otherwise overlapping genes will be incorrectly collapsed. Default is 0.

--intron-rnaseq-support: Amount of RNA-seq intron support a transcript must have to be considered. Must be a value between 0 and 100. Default is 0.

--exon-rnaseq-support: Amount of RNA-seq exon support a transcript must have to be considered. Must be a value between 0 and 100. Default is 0.

--intron-annot-support: Amount of reference intron annotation support a transcript must have to be considered. Must be a value between 0 and 100. Default is 0.

--exon-annot-support: Amount of reference exon annotation support a transcript must have to be considered. Must be a value between 0 and 100. Default is 0.

--original-intron-support: Amount of original intron support. See transcript evaluation description of original introns a transcript must have to be considered. Must be a value between 0 and 100. Default is 0.

--denovo-num-introns: For de-novo predictions, discard any transcripts with fewer than these number of introns. Important when RNA-seq data are noisy. Default is 0.

--denovo-splice-support: For de-novo predictions, discard any transcripts with less than this percent of RNA-seq intron support. Must be a value between 0 and 100. Default is 0.

--denovo-exon-support: For de-novo predictions, discard any transcripts with less than this percent of RNA-seq exon support. Must be a value between 0 and 100. Default is 0.

--denovo-ignore-novel-genes: For de-novo predictions, discard any transcripts that are predicted to be novel genes. In other words, only retain putative novel isoforms.

--denovo-allow-novel-ends: For de-novo predictions not derived from augCGP, do we allow for novel 5' or 3' ends to be sufficient to be considered a novel isoform?

--denovo-novel-end-distance: For de-novo predictions, allow transcripts to be included if they provide a novel 5' or 3' end N distance away from any existing ends. This flag only applies if --denovo-allow-novel-ends is set. Default is 0.

--denovo-allow-unsupported: For de-novo predictions, allow novel isoforms to be called if they contain splices that are not supported by the reference annotation even if they are also not supported by RNA-seq. Without this flag, novel isoforms will only be called if they have one or more splice that has RNA-seq/IsoSeq support and no reference annotation support.

--denovo-allow-bad-annot-or-tm: For de-novo predictions, allow novel isoforms to be called that were flagged as BadAnnotOrTm. These predictions overlap instances where multiple genes transMapped to the same location with significant overlap, and so may be alignment mistakes, collapsed repeats or gene family collapse.

--require-pacbio-support: If set, all isoforms in the final set must be supported by at least one IsoSeq read. This flag is likely to discard a ton of transcripts, so be careful.

--in-species-rna-support-only: If set, all of the above intron/exon support flags will look only at RNA-seq/IsoSeq data from the species in question, and not make use of homGeneMapping to check support in all species. The output plots will always report in-species support.

--rebuild-consensus: A convenience flag to allow you to adjust the flags above. When set, will force the pipeline to re-run consensus finding and will also re-build the downstream plots and assembly hub.

Assembly hub

--assembly-hub: Build an assembly hub? Default is false. Assembly hubs allow you to view your alignments and annotation on the UCSC browser.

--hub-email: Optionally, add an email to your assembly hub. Useful if you are planning on publishing the hub.

--hub-name: Optionally, change the default name of your assembly hub. Useful if you are planning on publishing the hub.

See below for toil options shared with the hints database pipeline.

Toil

The remaining options are passed directly along to toil:

--batchSystem: Batch system to use. Defaults to single_machine. If running in single_machine mode, no cluster jobs will be submitted. In addition, care must be taken to balance the --maxCores field with the --workers field with the toil resources in luigi.cfg. Basically, you want to make sure that your # of toil resources multiplied by your --maxCores is fewer than the total number of system cores you want to use. However, I highly recommend using a non-local batch system. See the toil documentation for more.

--maxCores: The number of cores each toil module will use. If submitting to a batch system, this limits the number of concurrent submissions.

Autoscale

To use the autoscale functionality, change the resources to toil = 1 in luigi.cfg

Follow these instructions to setup credentials for AWS and the toil leader node for the cluster.

--zone: AWS region to run on.

--nodeTypes: AWS instance type for the worker nodes.

--provisioner: The provisioner of the cloud system. For AWS it's just aws.

--maxNodes: Maximum number of nodes running at once.

Config file

The config file contains the paths to two important pieces of information -- the reference GFF3 and the extrinsic hints (bams).

A major component of producing high quality comparative annotations is making use of RNA-seq and/or IsoSeq information. This information is used as hints to the augustus gene finding tool along with transMap, and is a major component of cleaning up transcript projections. This is also useful if you run the augustusCGP or augustusPB portions of the pipeline.

If the genetic distances in your alignment are high (say maybe an average identity in the 70s-80s), then you may derive great benefit from using a protein reference, if possible. This will be particularly useful for augustusCGP.

A template for the config file is below. At a minimum, your config file must have the annotation section. A example config file is provided in the test_data folder.

BAM files must be indexed!

[ANNOTATION]
Genome = /path/to/reference/gff3

[BAM]
Genome = /path/to/fofn <OR> /path/to/bam1.bam, /path/to/bam2.bam

[INTRONBAM]
Genome = /path/to/fofn/of/noisy/rnaseq

[ISO_SEQ_BAM]
Genome = /path/to/isoseq/bams

[PROTEIN_FASTA]
Genome = /path/to/protein/fasta

Note that the BAM/INTRONBAM/ISO_SEQ_BAM fields can be populated either with a comma separated list of BAMs or a single file with a line pointing to each BAM (a FOFN, or file-of-file-names). The reference sequence information will be extracted from the HAL alignment.

For the PROTEIN_FASTA field, every genome you wish to have the protein fasta be aligned to must be on its own separate line. All of these can point to the same FASTA.

RNA-seq libraries

It is extremely important that you use high quality RNA-seq. Libraries should be poly-A selected and paired end with a minimum read length of 75bp. If any of these are not true, it is advisable to place these libraries in the INTRONBAM field. Any genome can have a mix of BAM and INTRONBAM hints.

BAM files must be genomic-coordinate sorted and indexed!

ISoSeq libraries

If you are using IsoSeq data, it is recommended that you doing your mapping with minimap2. These BAM files must also be genomic coordinate sorted and indexed.

Chain mode

If you do not have a HAL alignment file as input, but rather have .chain files that follow the format for UCSC genome browser chain files (https://genome.ucsc.edu/goldenPath/help/chain.html), you can run CAT using the following flags and modifications to the config file.

--chain_mode: Run CAT in chain mode. You no longer need to provide a HAL alignment file.

Add the following CHAIN and FASTA fields to the config file. You will need to provide paths to the chain file between the reference genome and target genomes, as well as paths to all of the fasta files for the reference genome and target genomes.

[CHAIN]
Genome = /path/to/chain

[FASTA]
Genome = /path/to/genome/fasta

Execution modes

The default mode of this pipeline will perform the following tasks:

  1. Lift all annotations present in the input GFF3 to all other genomes in the alignment.
  2. Filter these comparative transcripts for paralogous mappings.
  3. Evaluate these transcripts for potential problems, assigning a score.
  4. Produce a output annotation set as well a series of plots charting how this process went.

These steps will run reasonably fast on one machine without any need for cluster computing. However, to construct a high quality annotation set, it is recommended that the pipeline be run with as many modes of AUGUSTUS as possible.

AugustusTM(R)

The primary parameterization of AUGUSTUS for comparative annotation is primarily a method to clean up transMap projections. Due to a combination of assembly error, alignment noise and real biological changes transMap projections have frame shifting indels, missing or incomplete exons, and invalid splice sites. AugustusTM is given every protein coding transMap projection one at a time with some flanking sequence and asked to construct a transcript that closely matches the intron-exon structure that transMap provides. Since AUGUSTUS enforces a standard gene model, frame shifts and invalid splices will be adjusted to a valid form. In some cases this will mangle the transcript, producing either another isoform or something that does not resemble the source transcript. AugustusTMR runs the same genomic interval and transMap derived hints through AUGUSTUS a second time, but with less strict weights on the transMap hints and with the addition of extrinsic hints from RNA-seq and/or IsoSeq. This is particularly useful in regions where an exon was dropped in the Cactus alignment.

AugustusTM and AugustusTMR can be ran by providing the --augustus flag to the pipeline. AugustusTMR will only be ran for genomes with extrinsic information in the hints database. If you are running CAT on a non-mammal, you will want to modify the --augustus-species flag to one of the species listed here. Take care to check if your species has a UTR model, and adjust the --augustus-utr-off flag accordingly.

AugustusCGP

augustusCGP is the comparative mode of AUGUSTUS recently introduced by Stefanie Nachtweide. This mode of AUGUSTUS takes as input a HAL format multiple whole genome alignment and simultaneously produces ab-initio transcript predictions in all genomes, taking into account conservation as well as any extrinsic information provided. AugustusCGP allows for the introduction of novel isoforms and loci in the final gene sets.

AugustusCGP can be ran by providing the --augustus-cgp flag to the pipeline. If no previously trained model is provided to AugustusCGP via the --cgp-param flag, then the pipeline will automatically train the model using the given alignment. To do so, random subsets of the alignment will be extracted until --cgp-train-num-exons exons are included. In practice, for vertebrate genomes, a few thousand exons corresponding to a few megabases of sequence are sufficient. If your genomes are more dense, this may vary. The trained model will be written to the AugustusCGP working directory, and can be used again on alignments with similar genomes.

AugustusPB

AugustusPB is a parameterization of AUGUSTUS to try and predict alternative isoforms using long range data. If any IsoSeq data are provided in the config file, and the --augustus-pb flag is set, the genomes with IsoSeq data will be run through and the results incorporated in the final gene set. AugustusPB runs on single whole genomes.

Modules

While the primary mode of operation of the pipeline is to launch the RunCat module, you may want to run specific modules. Any submodule can be ran by changing the luigi invocation to specify a submodule class instead.

PrepareFiles

This module parses the GFF3 annotation input, creating a genePred format file as well as a sqlite database. In addition, sequence files for all target genomes are extracted and converted to 2bit.

This module will populate the folders --work-dir/reference and --work-dir/genome_files.

Chaining

This step is the first precursor step to transMap. Pairwise genomic Kent-style chains are produced for each target genome from the designated reference. This step uses Toil and can be parallelized on a cluster.

This module will populate the folder --work-dir/chaining.

TransMap

This step runs transMap. The chain files are used to project annotations present in the GFF3 from the reference genome to each target genome.

EvaluateTransMap

This step performs the preliminary classification of transMap transcripts. This step populates the TransMapEvaluation table in the sqlite database for each target genome with the following classifiers:

  1. AlnExtendsOffConfig: Does this alignment run off the end of a contig?
  2. AlignmentPartialMap: Did this transcript not map completely?
  3. AlnAbutsUnknownBases: Does this alignment have Ns immediately touching any exons?
  4. PercentN: Percent of N bases in the alignment.
  5. TransMapCoverage.
  6. TransMapIdentity.
  7. TransMapGoodness: A measure of alignment quality that takes into account both coverage and alignment size in the target. Related to Jim Kent's badness score.
  8. TransMapOriginalIntronsPercent: The number of transMap introns within a wiggle distance of a intron in the parent transcript in transcript coordinates.
  9. Synteny: Counts the number of genes in linear order that match up to +/- 5 genes.
  10. ValidStart -- start with ATG?
  11. ValidStop -- valid stop codon (in frame)?
  12. ProperOrf -- is the orf a multiple of 3?

This module will populate the folder --work-dir/transMap.

FilterTransMap

This module relies on the globalNearBest algorithm in pslCDnaFilter to resolve paralogies followed by using clusterGenes to resolve gene family collapse and overlapping loci.

This process has 4 steps:

  1. Filter out all projections whose genomic span is more than 5 times the original transcript. This is a hard coded filter to deal with the possibility of rearrangements leading to massive transMap projections. This is required also to allow the minSpan filter in pslCDnaFilter to work properly, as --minSpan is an effective filter against retroposed pseudogenes.
  2. Run pslCDnaFilter using the globalNearBest algorithm to identify the best set of alignments. Turning this value to a smaller number increases the number of alignments filtered out, which decreases the paralogous alignment call rate.
  3. Separate coding and non-coding genes and run both through clusterGenes with or without the -cds flag.
  4. For each gene ID in #2, see if it hits more than one cluster. Pick the highest scoring cluster. This resolves paralogy to ostensible 1-1 orthologs. This populates the GeneAlternateLoci tag.
  5. For each cluster ID in #2 that remains after #3, see if it hits more than one gene. If so, then we have a putative gene family collapse. Pick the highest average scoring gene and discard the other genes, populating the CollapsedGeneIds and CollapsedGeneNames tags.
  6. Perform a rescue step where transMaps that were filtered out by paralog resolution but overlap a valid cluster are re-added to the set despite not being globalNearBest.

After these steps, the transcripts are evaluated for split genes. This process takes the max span filtered set and looks at each transcript separately, seeing if there exists projections on either the same contig or different contigs that are disjoint in original transcript coordinates. This implies that there was a split or a rearrangement.

This module will further populate the folder --work-dir/transMap.

Augustus

As discussed above, this module runs AugustusTM(R). If the pipeline is ran without a hints database, only the AugustusTM mode will be executed. This process is one of the most computationally intensive steps, and should not be ran without a cluster.

This module will populate the folder --work-dir/augustus.

AugustusCgp

Running AugustusCGP is trickier than other modes. If your genomes are not closely related to an existing training set, you may need to perform logistic regression to train AugustusCGP before execution. A default parameter set is provided. This mode is also computationally intensive, and requires a cluster.

Each output transcript are assigned a parental gene, if possible. Parental gene assignment is done by looking to see if this transcript has at least 1 exonic base overlap with any filtered TransMap as well as unfiltered transMap. If the transcript overlaps more than one gene, the Jaccard metric is used to try and resolve the ambiguity. If no gene stands out, this transcript is discarded. A sqlite table will record both the filtered and unfiltered overlaps.

Transcripts which are not assigned a parental gene will be considered novel in the consensus finding step. Most often, these are the result of gene family expansion or contraction in the reference. Looking at the raw transMap track in the final assembly hub will help resolve this.

This module will populate the folder --work-dir/augustus_cgp.

AugustusPb

Running AugustusPB requires that IsoSeq data be provided. This mode runs on single genomes, and attempts to discover new isoforms. Transcripts predicted in this process undergo the same parental gene assignment described above.

This module will populate the folder --work-dir/augustus_pb.

Hgm

homGeneMapping is a companion tool of AugustusCGP. This tool uses a HAL alignment to project RNA-seq and annotation information to target genomes. This is used to validate a splice junction in a target genome as being supported in one or more alternative genomes, as well as being supported in the reference annotation. This module populates the *_Hgm database table, where * is one of transMap, augTM, augTMR, augCGP or augPB depending on the transcripts being evaluated. This table has the following comma separated columns:

  1. AllSpeciesIntronRnaSupport. The number of species with RNA-seq data supporting the intron junctions, in genomic order.
  2. AllSpeciesExonRnaSupport. The number of species with RNA-seq data suporting the exons, in genomic order.
  3. IntronRnaSupport. Same as #1, but only within this species.
  4. ExonRnaSupport. Same as #2, but only within this species.
  5. IntronAnnotSupport. A bit vector indicating if the intron junctions are supported by the reference annotation.
  6. CdsAnnotSupport. A bit vector indicating if the CDS features are supported by the reference annotation.
  7. ExonAnnotSupport. A bit vector indicating if the exon features are supported by the reference annotation.

This module will populate the folder --work-dir/hgm.

The output of the homGeneMapping module has more information embedded in the output files. Each GTF format file in the above folder has a added column on the end with a string like:

"0E-6273,1E-1524,2N:M*-1,3E-742,4E-1912,5E-1208"

Which can be interpreted as 'species 0 had 6273 extrinsic hints (RNA-seq coverage), species 1 has 1524 extrinsic hints, species 2 (the reference) had both a non-coding (N) and coding (M) junction', and so on. The species numeric values are at the top of the file, and correlate to the species ID assigned internally in the hints database. These data can be useful if you want to dig in to a specific annotation.

AlignTranscripts

Transcript alignment allows for AugustusTM(R) transcripts to be compared to their parental transMap. As a result, only protein coding transcripts are aligned. For each transcripts, alignment is performed by parasail two ways -- CDS alignment, and mRNA alignment. The results of these alignments are saved in the folder --work-dir/transcript_alignment. These alignments are used to create functional annotations of transcripts in the EvaluateTranscripts module.

EvaluateTranscripts

A series of classifiers that evaluate transcript pairwise alignments for transMap and AugustusTM(R) output.

These classifiers are broken down into 2 groups, which will each end up as a table in the database:

<alnMode>_<txMode>_Metrics:

These classifiers are per-transcript evaluations based on both the transcript alignment and the genome context.

  1. PercentUnknownBases: % of mRNA bases that are Ns.
  2. AlnCoverage: Alignment coverage in transcript space.
  3. AlnIdentity: Alignment identity in transcript space.
  4. OriginalIntrons. Original introns is a bit vector that evaluates whether the intron junctions in transcript coordinate space are within 5 bases either direction from the original transcript. This is a powerful approach to identifying retroposed pseudogenes or problematic alignments.
  5. ValidStart -- start with ATG?
  6. ValidStop -- valid stop codon (in frame)?
  7. ProperOrf -- is the orf a multiple of 3?
  8. AdjStart -- the position of the new thickStart taking frame-shifts into account (genomic coordinates, + strand).
  9. AdjStop -- the position of the new thickStop taking frame-shifts into account (genomic coordinates, + strand).

<alnMode>_<txMode>_Evaluation:

These classifiers are per-transcript evaluations based on the transcript alignment. Unlike the other two tables, this table stores the actual location of the problems (in genome coordinates) as a BED-like format. In cases where there are multiple problems, they will be additional rows.

  1. CodingInsertion: Do we have any frame-shifting coding insertions?
  2. CodingDeletion: Do we have any frame-shifting coding deletions?
  3. CodingMult3Insertion: Do we have any mod3 coding insertions?
  4. CodingMult3Deletion: Do we have any mod3 coding deletions?
  5. NonCodingInsertion: Do we have indels in UTR sequence?
  6. NonCodingDeletion: Do we have any indels in UTR sequence?
  7. InFrameStop: Are there any in-frame stop codons?

Where txMode is one of transMap, augTM, augTMR and alnMode is one of CDS or mRNA.

The evaluation tables will be loaded as tracks in the final assembly hub.

Consensus

The consensus finding process takes in transcripts from every mode and attempts to find the highest quality ortholog for a source transcript. The de-novo transcript modes are also evaluated for providing novel isoforms or novel loci. The final gene set is output with a series of features measuring how confident the prediction is.

To evaluate transMap, AugustusTM and AugustusTMR transcripts a consensus score is assigned to each. This score is the sum of the alignment goodness, intron/exon annotation support, original intron support, and intron/exon RNA-seq/IsoSeq support if extrinsic data were provided. The transcript with the highest consensus score is chosen.

If one of the de-novo augustus modes is run, then the those transcripts are evaluated for providing novel information. If a prediction did not overlap any transMap projections, then it is tagged as putative novel and incorporated into the gene set. If a prediction overlaps a transMap projection that was filtered out during paralog resolution, then it is tagged as a possible paralog as well as with the names of overlapping transcripts and incorporated into the gene set. If a prediction overlaps a transMap projection and contains a splice junction not seen in the reference annotation, then it is tagged as a novel isoform and incorporated into the gene set as a member of the gene it overlapped with.

After consensus finding is complete, a final filtering process is performed. This filtering process deduplicates the transcript set. Duplicates most often occur when the augustus execution modes create an identical transcript model from different input isoforms. In this case, the duplicates are removed and the remaining transcript tagged with the names of alternative source transcripts. Strand resolution throws out transcripts that are on opposite strands. The correct strand is chosen by looking at which contains the most high quality transcripts. Finally, the transcripts are again clustered using clusterGenes on CDS intervals to resolve the case where incorporating novel predictions lead to different gene IDs sharing CDS bases.

After consensus finding, a final output gene set is produced in both GFF3 and genePred format. The genePred annotations also have a additional .gp_info file that has the additional fields described below.

The output will appear in --output-dir/consensus.

Plots

A large range of plots are produced in --output-dir/plots. These include:

  1. denovo.pdf. If either de-novo mode was ran, this plot will report the results. See the above description of the tags in consensus or GFF3 tags sections.
  2. completeness.pdf: The number of genes/transcripts successfully mapped over. The top of the x axis is marked with a red line representing the amount of genes/transcripts the source annotation had.
  3. consensus_extrinsic_support: A violin plot of the level of extrinsic support seen across all species for splices and exons, as found by homGeneMapping. Provides a overall plot and a per-biotype plot.
  4. consensus_anotation_support: A violin plot of the level of annotation support seen across all species for splices and exons, as found by homGeneMapping. Provides a overall plot and a per-biotype plot.
  5. coverage.pdf: A violinplot that shows the overall transcript coverage in the consensus set. Provides a overall plot and a per-biotype plot.
  6. identity.pdf: A violinplot that shows the overall transcript identity in the consensus set. Provides a overall plot and a per-biotype plot.
  7. transmap_coverage.pdf: A violinplot that shows the overall transcript coverage in the filtered transMap output. Provides a overall plot and a per-biotype plot.
  8. transmap_identity.pdf: A violinplot that shows the overall transcript identity in the filtered transMap output. Provides a overall plot and a per-biotype plot.
  9. missing_genes_transcripts.pdf: Similar to completeness.pdf, this plot reports the number of genes and transcripts in the original annotation set not found on the target genomes.
  10. paralogy.pdf: Stacked bar charts of the number of alignments a given source transcript had in each target after globalNearBest filtering. This represents a best guess of true paralogy.
  11. unfiltered_paralogy.pdf: Stacked bar charts of the number of alignments a given source transcript had in each target. This represents all possible alignments for a given source transcript.
  12. split_genes.pdf: The number of transMap genes split within and between contigs.
  13. transcript_modes.pdf: The number of modes that supported a given comparative annotation. Applies only to protein coding transcripts derived from transMap, because AugustusTMR is not ran on non-coding inputs.
  14. augustus_improvement.pdf: A scatterplot + density plot reporting the improvement of primary consensus metrics when an augustus transcript was chosen over a transMap transcript. The density plot may fail in some cases.
  15. coding_indels.pdf: The rate of insertions, deletions and indels that are a multiple of 3 are reported from the final consensus set based on the pairwise alignments. Preference is given to the CDS space alignment, if it worked.
  16. IsoSeq_isoform_validation.pdf: The number of transcripts in the consensus set whose intron structure is exactly validated by at least one IsoSeq read.
  17. gene_family_collapse.pdf: The X-axis for each plot is the number of genes in the source transcript set that were collapsed into one locus, and the Y-axis is the number of this this occurred. So, for example, if X=1 and Y=200 that means there were 200 instances of 2 genes being collapsed into 1.

GFF3 tags:

  1. gene_id: Unique gene ID assigned to this gene.
  2. transcript_id: Unique transcript ID assigned to this gene.
  3. alignment_id: Original alignment ID internally used in the pipeline. Provides a link to the gene sets input to consensus finding.
  4. alternative_source_transcripts: If deduplication collapsed transcripts, report the other source_transcript IDs.
  5. exon_annotation_support: Was this exon supported by the reference annotation?
  6. exon_rna_support: Was this exon supported by the extrinsic database?
  7. frameshift: Is this transcript frameshifted relative to source_transcript?
  8. gene_biotype: The source_gene biotype. If this is a de-novo prediction, this field will say unknown_likely_coding.
  9. intron_annotation_support: Was this intron supported by the reference annotation?
  10. intron_rna_support: Was this intron supported by the extrinsic database?
  11. source_gene: The gene ID of the source gene, if this is a projection transcript.
  12. source_gene_common_name: The common name of the source gene.
  13. source_transcript: The ID of the source transcript.
  14. transcript_biotype: The biotype of the source transcript, or unknown_likely_coding for de-novo predictions.
  15. transcript_class: For projection transcripts, just says ortholog. For de-novo transcripts, will be one of poor_alignment, possible_paralog, putative_novel_isoform, or putative_novel. See the consensus finding section for descriptions.
  16. transcript_modes: Comma separated list of transcript modes. The same information as the transcript_modes.pdf plot.
  17. pacbio_isoform_supported: Was this isoform supported by at least one IsoSeq read?
  18. paralogy: Comma separated list of alignments identified as possible paralogs for this transcript.
  19. possible_split_gene_locations: If this gene was split across multiple contigs, this will have a comma separated list of alternative locations.
  20. collapsed_gene_names: If this gene was a part of a gene family collapse, this field reports the common names of genes collapsed together here.
  21. collapsed_gene_ids: Same as above, but with unique identifiers.
  22. gene_alternate_loci: If this gene was identified to have paralogous mappings that were filtered out, these intervals are where the paralogs were found.

For GFF3 output, the alignment goodness is in the score field. For .gp_info, it is a column. For .gp_info, the support features are collapsed into comma separated vectors instead of being on their respective features.

comparative-annotation-toolkit's People

Contributors

ccoulombe avatar diekhans avatar esrice avatar ifiddes avatar ifiddes-10x-zz avatar joelarmstrong avatar laninsky avatar mhaukness-ucsc avatar nathanweeks avatar ph09 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

comparative-annotation-toolkit's Issues

Run HGM module only once

HomGeneMapping is memory and cpu intensive, and not able to be split over many nodes. Right now, it is run separately on each gene set. This could be consolidated to save computational effort, but would require code to subsequently break up the results into their respective databases.

Use synteny score to make TMR smarter

If a transMap produces multiple transcripts for a source transcript, and they are within some distance of each other on the same sequence, and they have a similar synteny score, then provide them simultaneously as hints to augustus TM(R).

This could help fix transcripts fragmented by intronic rearrangements.

Consolidate and reorganize get_args() functions

Each module has its own get_args() function which could be reorganized into the PipelineTask base class. The downside of this is that class would be huge, but then all file paths would be generated in the same place.

-rnaNameAttr is not a valid option

I am trying to run CAT on the test data and get the following error:

05-08 09:03:07 luigi-interface INFO [pid 8910] Worker Worker(salt=007108439, workers=10, host=burrito, username=fbemm, pid=8298) running Task: Gff3ToGenePred
05-08 09:03:07 cat INFO Converting annotation gff3 to genePred.
-rnaNameAttr is not a valid option
05-08 09:03:07 luigi-interface ERROR [pid 8910] Worker Worker(salt=007108439, workers=10, host=burrito, username=fbemm, pid=8298) failed Task: Gff3ToGenePred
Traceback (most recent call last):
File "/ebio/abt6_projects9/abt6_software/bin/Comparative-Annotation-Toolkit/env/local/lib/python2.7/site-packages/luigi/worker.py", line 191, in run
new_deps = self._run_get_new_deps()
File "/ebio/abt6_projects9/abt6_software/bin/Comparative-Annotation-Toolkit/env/local/lib/python2.7/site-packages/luigi/worker.py", line 129, in _run_get_new_deps
task_gen = self.task.run()
File "/ebio/abt6_projects9/abt6_software/bin/Comparative-Annotation-Toolkit/cat/init.py", line 676, in run
self.run_cmd(cmd)
File "/ebio/abt6_projects9/abt6_software/bin/Comparative-Annotation-Toolkit/cat/init.py", line 357, in run_cmd
tools.procOps.run_proc(cmd, stdout=outf)
File "/ebio/abt6_projects9/abt6_software/bin/Comparative-Annotation-Toolkit/tools/procOps.py", line 36, in run_proc
pl.wait()
File "/ebio/abt6_projects9/abt6_software/bin/Comparative-Annotation-Toolkit/tools/pipeline.py", line 1127, in wait
self.raiseIfExcept()
File "/ebio/abt6_projects9/abt6_software/bin/Comparative-Annotation-Toolkit/tools/pipeline.py", line 1085, in raiseIfExcept
p.raiseIfExcept()
File "/ebio/abt6_projects9/abt6_software/bin/Comparative-Annotation-Toolkit/tools/pipeline.py", line 749, in raiseIfExcept
raise self.exceptInfo[0], self.exceptInfo[1], self.exceptInfo[2]
ProcException: process exited 255: gff3ToGenePred -rnaNameAttr=transcript_id -geneNameAttr=gene_id -honorStartStopCodons /ebio/abt6_projects9/abt6_software/bin/Comparative-Annotation-Toolkit/test_data/GRCm38.mm10.subset.gff3 /dev/stdout

cranking up memory default in augustus_cgp.py

I am seeing the same error message as before (Job used more disk than requested. Cconsider modifying the user script to avoid the chance of failure due to incorrectly requested resources.)

As a quick fix, I am increasing the default mem to 8G where appropriate.

suggestion for improvement of manual: index bams

Just noticed that bam files need to be indexed; I think I saw this mentioned in an earlier version of the manual, but it seems to be gone.

Maybe good to mention that in the documentation (which is great btw!)

Faster GFF3 input parsing

The GFF3 parser takes almost 30 minutes to parse the Gencode GFF3 and uses a lot of memory. This is probably unnecessary because we don't need to properly represent the entire GFF3 - we only need to extract the relevant attributes table from it.

augustus_chunk completely fails but CAT keeps running

Hi Ian,

I wanted to run this by you. Basically, one of my jobs is still running at the Augustus_TMR stage, but is not submitting toil jobs anymore and the entries in the log occurred quite a while ago. While I am expecting Augustus to be slow, I wondering whether this is the expected behavior. Especially as I discovered the error message pasted below.

Your insights, as always, are appreciated!

04-29 19:11:53 toil.leader WARNING The job seems to have left a log file, indicating failure: 'run_augustus_chunk' D/k/jobPs_7_y
---TOIL WORKER OUTPUT LOG---
D/k/jobPs_7_y    INFO:toil:Running Toil version 3.5.1-eb5032ad339a5114b97bd64c6dcd041a078884b5.
D/k/jobPs_7_y    WARNING:toil.resource:Can't find resource for leader path '/n/home01/lassance/Comparative-Annotation-Toolkit/cat'
D/k/jobPs_7_y    WARNING:toil.resource:Can't localize module ModuleDescriptor(dirPath='/n/home01/lassance/Comparative-Annotation-Toolkit', name='cat.augustus'
, fromVirtualEnv=False)
D/k/jobPs_7_y    Traceback (most recent call last):
D/k/jobPs_7_y      File "/n/home01/lassance/.conda/envs/ENV_PROGRESSIVECACTUS/lib/python2.7/site-packages/toil/worker.py", line 336, in main
D/k/jobPs_7_y        with fileStore.open(job):
D/k/jobPs_7_y      File "/n/home01/lassance/.conda/envs/ENV_PROGRESSIVECACTUS/lib/python2.7/contextlib.py", line 17, in __enter__
D/k/jobPs_7_y        return self.gen.next()
D/k/jobPs_7_y      File "/n/home01/lassance/.conda/envs/ENV_PROGRESSIVECACTUS/lib/python2.7/site-packages/toil/fileStore.py", line 1602, in open
D/k/jobPs_7_y        self.findAndHandleDeadJobs(self.workFlowDir)
D/k/jobPs_7_y      File "/n/home01/lassance/.conda/envs/ENV_PROGRESSIVECACTUS/lib/python2.7/site-packages/toil/fileStore.py", line 1723, in findAndHandleDeadJ
obs
D/k/jobPs_7_y        for jobState in cls._getAllJobStates(nodeInfo):
D/k/jobPs_7_y      File "/n/home01/lassance/.conda/envs/ENV_PROGRESSIVECACTUS/lib/python2.7/site-packages/toil/fileStore.py", line 1776, in _getAllJobStates
D/k/jobPs_7_y        yield NonCachingFileStore._readJobState(filename)
D/k/jobPs_7_y      File "/n/home01/lassance/.conda/envs/ENV_PROGRESSIVECACTUS/lib/python2.7/site-packages/toil/fileStore.py", line 1780, in _readJobState
D/k/jobPs_7_y        with open(jobStateFileName) as fH:
D/k/jobPs_7_y    IOError: [Errno 2] No such file or directory: '/tmp/toil-c007bd19-5454-421f-89ce-387db5e3f144/tmp9zbqL2/1566926f-1200-4190-a413-6e9d773ae6ae/
.jobState'
D/k/jobPs_7_y    ERROR:toil.worker:Exiting the worker because of a failed job on host holy2a08201.rc.fas.harvard.edu
D/k/jobPs_7_y    WARNING:toil.jobGraph:Due to failure we are reducing the remaining retry count of job 'run_augustus_chunk' D/k/jobPs_7_y with ID D/k/jobPs_7_
y to 0
04-29 19:11:53 toil.leader WARNING Job 'run_augustus_chunk' D/k/jobPs_7_y with ID D/k/jobPs_7_y is completely failed

question: how to integrate external results from Augustus-ppx?

Hi Ian,

Some of gene families that are of particular interest to me are, unfortunately, not well annotated by Augustus if run in de novo or cgp mode. However, I can get relatively good predictions for those genes when running Augustus using the ppx mode and some custom protein profiles.

Would there be a quick and easy way to add the outputs from these analyses to your pipeline?

AugustusTMR runs out of memory sometimes.

The last AugustusTMR run took about 100 minutes per job on Rhesus, and 50 on Orangutan. Rhesus jobs sometimes ran out of the 8gb of memory. This is due to the large amount of RNAseq data for Rhesus in the database.

Ideas:

  1. Can Toil re-attempt a job with more memory? Find out.
  2. Attempt to auto-scale memory requirements based on number of BAMs in database.

Also, reduce the number of total Augustus runs per job to reduce the job runtime.

ete3 version not recognized

Hi,

I am having trouble installing the pipeline. In particular, it seems that setup.py is not satisfied with the version of ete3 I have.

pip install -e Comparative-Annotation-Toolkit
Obtaining file:///n/home01/lassance/Comparative-Annotation-Toolkit
  Running setup.py (path:/n/home01/lassance/Comparative-Annotation-Toolkit/setup.py) egg_info for package from file:///n/home01/lassance/Comparative-Annotation-Toolkit
    
    file cat.py (for module cat) not found
    file tools.py (for module tools) not found
Requirement already satisfied (use --upgrade to upgrade): pyfasta>=0.5.2 in ./.conda/envs/ENV_PROGRESSIVECACTUS/lib/python2.7/site-packages (from cat==0.1)
Requirement already satisfied (use --upgrade to upgrade): toil>=3.0 in ./.local/lib/python2.7/site-packages (from cat==0.1)
Requirement already satisfied (use --upgrade to upgrade): luigi>=2.5 in ./.conda/envs/ENV_PROGRESSIVECACTUS/lib/python2.7/site-packages (from cat==0.1)
Requirement already satisfied (use --upgrade to upgrade): seaborn>=0.7 in ./.conda/envs/ENV_PROGRESSIVECACTUS/lib/python2.7/site-packages (from cat==0.1)
Requirement already satisfied (use --upgrade to upgrade): pandas>=0.18 in ./.local/lib/python2.7/site-packages (from cat==0.1)
Requirement already satisfied (use --upgrade to upgrade): frozendict in ./.conda/envs/ENV_PROGRESSIVECACTUS/lib/python2.7/site-packages (from cat==0.1)
Requirement already satisfied (use --upgrade to upgrade): configobj>=5.0 in ./.local/lib/python2.7/site-packages (from cat==0.1)
Requirement already satisfied (use --upgrade to upgrade): sqlalchemy>=1.0 in ./.local/lib/python2.7/site-packages (from cat==0.1)
Downloading/unpacking ete3>=3.0 (from cat==0.1)
  Could not find a version that satisfies the requirement ete3>=3.0 (from cat==0.1) (from versions: 3.0.0b27, 3.0.0b32, 3.0.0b14, 3.0.0b26, 3.0.0b12, 3.0.0b13, 3.0.0b33, 3.0.0b31, 3.0.0b23, 3.0.0b8, 3.0.0b10, 3.0.0b11, 3.0.0b9, 3.0.0b7, 3.0.0b22, 3.0.0b34, 3.0.0b16, 3.0.0b29, 3.0.0b30, 3.0.0b15, 3.0.0b20, 3.0.0b24, 3.0.0b18, 3.0.0b35, 3.0.0b21, 3.0.0b17)
Cleaning up...
No distributions matching the version for ete3>=3.0 (from cat==0.1)
Storing debug log for failure in /n/home01/lassance/.pip/pip.log

However, I actually have a recent version installed

(ENV_PROGRESSIVECACTUS)[lassance@rcnx01 ~]$ conda list | grep ete3
ete3                      3.0.0b35                  <pip>
ete3_external_apps        2.0.4                    py27_0  

Any idea what is happening here?

Handle stable IDs.

Write the annotation set as a database after consensus finding. Two tables: 1) genePred structure, 2) gp_info table.

Take as input to the pipeline this database as an optional argument. If provided, attempt to use the Ensembl algorithm for maintaining stable ID mappings, where possible.

Problems:

  1. How to handle the case where the assembly itself has changed as well.

Plots not scheduled

I am trying to get the latest commit working on our data. There are several warnings and errors. I will open an issue for each of them.

luigi --module cat RunCat --hal=PRET_male-female.hal --ref-genome=PRET_female --workers=16 --config=PRET_male.config --work-dir=working --out-dir=results --augustus --augustus-cgp --assembly-hub --target-genomes='("PRET_male",)' --augustus-species=PRET

WARNING: Will not run Task: Plots or any dependencies due to error in complete() method:
Traceback (most recent call last):
  File "/ebio/abt6_projects9/abt6_software/bin/Comparative-Annotation-Toolkit/env/local/lib/python2.7/site-packages/luigi/worker.py", line 328, in check_complete
    is_complete = task.complete()
  File "/ebio/abt6_projects9/abt6_software/bin/Comparative-Annotation-Toolkit/env/local/lib/python2.7/site-packages/luigi/task.py", line 539, in complete
    outputs = flatten(self.output())
  File "/ebio/abt6_projects9/abt6_software/bin/Comparative-Annotation-Toolkit/cat/__init__.py", line 2027, in output
    args = Plots.get_args(pipeline_args)
  File "/ebio/abt6_projects9/abt6_software/bin/Comparative-Annotation-Toolkit/cat/__init__.py", line 1986, in get_args
    include_ancestors=pipeline_args.annotate_ancestors)
  File "/ebio/abt6_projects9/abt6_software/bin/Comparative-Annotation-Toolkit/tools/hal.py", line 36, in build_genome_order
    distances, ordered_names = zip(*ordered)
ValueError: need more than 0 values to unpack

INFO: Informed scheduler that task   Plots_False_True_True_dcaadf55e0   has status   UNKNOWN

CPG Training

CAT tries to call Augustus like this during training and dies

cmd = ['augustus',
'--species={}'.format(args.species),
'--treefile={}'.format(tree_path),
'--alnfile={}'.format(cat_maf),
'--dbaccess={}'.format(hints_db),
'--speciesfilenames={}'.format(genome_fofn),
'--refSpecies={}'.format(args.ref_genome),
'--referenceFile={}'.format(gtf),
'--param_outfile={}'.format(params)]

Problem is, that augustus "param_outfile" parameter does not exists according to "augustus --paramlist".

I guess it should be just outfile but I am not sure.

Any idea?

default disk and memory settings not propagated in Augustus jobs

Hi,

I am getting error messages pointing that the jobs used more resources than requested

04-20 16:30:45 toil.statsAndLogging WARNING Got message from job at time 04-20-2017 16:30:45: Job used more disk than requested. Cconsider modifying the user 
script to avoid the chance of failure due to incorrectly requested resources. Job J/H/jobLBFkRk/g/tmpJLhGTC.tmp used 280.16% (11.2 GB [12032901120B] used, 4.0
 GB [4294967296B] requested) at the end of its run.
Exception in thread Thread-141:
Traceback (most recent call last):
  File "/n/home01/lassance/.conda/envs/ENV_PROGRESSIVECACTUS/lib/python2.7/threading.py", line 801, in __bootstrap_inner
    self.run()
  File "/n/home01/lassance/.conda/envs/ENV_PROGRESSIVECACTUS/lib/python2.7/site-packages/toil/batchSystems/slurm.py", line 179, in run
    activity |= self.checkOnJobs()
  File "/n/home01/lassance/.conda/envs/ENV_PROGRESSIVECACTUS/lib/python2.7/site-packages/toil/batchSystems/slurm.py", line 160, in checkOnJobs
    status = self.getJobExitCode(self.slurmJobIDs[jobID])
  File "/n/home01/lassance/.conda/envs/ENV_PROGRESSIVECACTUS/lib/python2.7/site-packages/toil/batchSystems/slurm.py", line 235, in getJobExitCode
    state, rc = self._getJobDetailsFromScontrol(slurmJobID)
TypeError: 'NoneType' object is not iterable

I tried to crank things up by adding the following options in the command, but they don't seem to have an effect:

--defaultDisk=13G --defaultMemory=13G  --disableCaching --Augustus-defaultDisk=13G --Augustus-defaultMemory=13G --augustus

Thanks for looking into that!

JML

Construct HashableNamespace Parameter

Replace usage of DictParameter to pass around parameters with your HashableNamespace implementation. That way is cleaner. You need to write a way to serialize HashableNamespaces to JSON and then decode them.

Filter AugustusTMR/CGP results for invalid GTF

There exists an edge case in Augustus where incomplete GTF records can be produced like this one for Rhesus:

$grep jg21893.t1 Rhesus.augCGP.gtf chr9 AugustusCGP transcript 91610925 91610927 . + . gene_id "jg21893"; transcript_id "jg21893.t1"; gene_name "jg21893"; chr9 AugustusCGP exon 91610925 91610927 . + . gene_id "jg21893"; transcript_id "jg21893.t1"; exon_number "1"; exon_id "jg21893.t1.1"; gene_name "jg21893"; chr9 AugustusCGP stop_codon 91610925 91610927 . + 0 gene_id "jg21893"; transcript_id "jg21893.t1"; exon_number "1"; exon_id "jg21893.t1.1"; gene_name "jg21893";

Cleaner logging.

Investigate using repr in individual tasks to clean up the log, not showing all of the args that are insignificant for understanding the log history.

Use RNAseq information when generating consensus gene set.

Bonus score should be given to transcripts who have supported introns. In order to do this, support will need to be extracted from the RNAseq hints db.

Possible way for this to work: weighted score = 0.3 * coverage + 0.5 * identity + 0.2 * (% supported exons)

speeding up handling large number of hints bam files

I am running the pipeline with 4 mammalian genomes, and have 87 bam files of RNA-Seq data for the target species.

The pipeline seems to be stuck in the GenerateHints and Chaining tasks (still not completed after >36h). I am using slurm as batcg system (specified workers=32 and maxCores=500, though I rarely see more than 20 toil jobs running at a given time)

I wondering if this is due to the large number of inputs. Is there a way to speed up the process by i.e. providing the hints.gff directly ?

joingenes Segfault

Me again :)

augustus_pb and augustus_cgp call joingenes and some point but it throws a segfault. I just tried the following:

joingenes -g galGal4.augTMR.gtf,galGal4.augTMR.gtf -o /dev/stdout

It gives a segfault while.

joingenes -g galGal4.augTM.gtf,galGal4.augTM.gtf -o /dev/stdout

This one not.

I am not sure it is a file format issue. Supplying the gtfs as fofn does not change the picture.

Any ideas on this one?

F

Speed vs. Simplicity

The overall flow of this pipeline is

  1. Prepare reference files.
  2. Per-genome analyses.
  3. Combined plots.

There are two ways to conceptualize the pipeline, one which allows for ease of modularity, and one which is more complex, but faster. The modular way requires all of the genomes undergo one task group (GenomeFiles, Chaining, TransMap) before the next task group begins. The fast way would allow each genome to start the next step as soon as it is done.

The problem with the fast way is that it makes starting intermediate tasks more complex. It means you must specify the specific genomes you want to re-run through each task group. It is much cleaner to launch task groups as a block. You can still launch specific members of a task group through the command line interface by specifying the genome parameter.

At the current state, I am implementing the simpler option. Tasks are grouped by wrapperTasks which represent major sections of the workflow. These can all be launched through the same command line options as the main RunCat wrapper task (which wraps each task group).

Fix score field for genePred output

Score needs to be an integer. Scale consensus score to be a integer in the range 0-1000. Do the same for the GFF3 version for consistency.

Handle GTF input

This will probably require a secondary input file with the attributes table.

--target-genomes cause error (bug?)

I am reaching out because while the test example worked fine, I am running into some problems when trying to get the pipeline to run with my own data.

my command:

luigi --module cat RunCat --hal=/n/hoekstrafs4/lassance/ProgressiveCactus/ALN_MOUSE_RAT_VOLE_DEERMOUSE/output.hal --ref-genome=MOUSE --config=run1.config --work-dir /n/regal/hoekstra_lab/lassance/CAT/run1 --out-dir /n/regal/hoekstra_lab/lassance/CAT/run1 --target-genomes=DEERMOUSE --batchSystem Slurm --workers=4 --maxCores=500 --RunCat-disableCaching --augustus --augustus-cgp --RunCat-resolve-split-genes

and the error:

ERROR: Uncaught exception in luigi
Traceback (most recent call last):
  File "/n/home01/lassance/.conda/envs/ENV_PROGRESSIVECACTUS/lib/python2.7/site-packages/luigi/retcodes.py", line 74, in run_with_retcodes
    worker = luigi.interface._run(argv)['worker']
  File "/n/home01/lassance/.conda/envs/ENV_PROGRESSIVECACTUS/lib/python2.7/site-packages/luigi/interface.py", line 237, in _run
    return _schedule_and_run([cp.get_task_obj()], worker_scheduler_factory)
  File "/n/home01/lassance/.conda/envs/ENV_PROGRESSIVECACTUS/lib/python2.7/site-packages/luigi/cmdline_parser.py", line 116, in get_task_obj
    return self._get_task_cls()(**self._get_task_kwargs())
  File "/n/home01/lassance/.conda/envs/ENV_PROGRESSIVECACTUS/lib/python2.7/site-packages/luigi/cmdline_parser.py", line 133, in _get_task_kwargs
    res.update(((param_name, param_obj.parse(attr)),))
  File "/n/home01/lassance/.conda/envs/ENV_PROGRESSIVECACTUS/lib/python2.7/site-packages/luigi/parameter.py", line 988, in parse
    return literal_eval(x)  # if this causes an error, let that error be raised.
  File "/n/home01/lassance/.conda/envs/ENV_PROGRESSIVECACTUS/lib/python2.7/ast.py", line 80, in literal_eval
    return _convert(node_or_string)
  File "/n/home01/lassance/.conda/envs/ENV_PROGRESSIVECACTUS/lib/python2.7/ast.py", line 79, in _convert
    raise ValueError('malformed string')
ValueError: malformed string

I am wondering whether this has to do with the fact the paths to the different files are rather long (plus there are on different filesystems).

Thanks

Design mixture model to learn mismatch rate vs alignment/assembly error rate

In order to make rate estimations robust, and allow for a way to discriminate alignment quality, design a mixture model.

Right now, I think the easiest way may be to model it as two lognormal distributions. Partly this is for simplicity - pomegranate does not currently support gamma distributions in mixture models.

gff3 output not suitable for conversion in json format

I would like to visualize the gff3 outputs using the Apollo Genome Viewer and attempted to convert the files using the flatfile-to-json.pl script from JBrowse.

Unfortunately, this does not work properly as the script detects some inconsistencies:

GFF3 parse error: some features reference other features that do not exist in the file 
(or in the same '###' scope).  A list of them:
ID                 |           Cannot Find
----------------------------------------------------------------------
ENSMUST00000200561. | Parent=ENSMUSG00000105709.1

Indeed it turns out the feature Parent is not defined as expected:

chr1	CAT	gene	169477789	169491203	.	-	.	ID=DEERMOUSE_G0004594;Name=Gm42723;failed_gene=False;gene_biotype=antisense;source_gene=ENSMUSG00000105709.1;source_gene_common_name=Gm42723;transcript_modes=transMap
chr1	CAT	transcript	169477789	169491203	7450	-	.	ID=ENSMUST00000200561.1-1;Name=Gm42723;Parent=ENSMUSG00000105709.1;alignment_id=ENSMUST00000200561.1-1;failed_gene=False;frameshift=nan;gene_biotype=antisense;gene_id=DEERMOUSE_G0004594;source_gene=ENSMUSG00000105709.1;source_gene_common_name=Gm42723;source_transcript=ENSMUST00000200561.1;source_transcript_name=Gm42723-001;transcript_biotype=antisense;transcript_class=passing;transcript_id=DEERMOUSE_T0012117;transcript_modes=transMap

In this example, the parent of ENSMUST00000200561.1 should be 'DEERMOUSE_G0004594', not 'ENSMUSG00000105709.1', isn't it?

Note that I did not run the pipeline with the --assembly-hub flag (not sure that it would generate other files compatible with the genome browser).

As always, your help/suggestion is greatly appreciated!

Pipeline crashes without extrinsic support

If no extrinsic support is provided, the pipeline will fail on consensus finding because homGeneMapping did not find anything extrinsic.

attrs['rna_support'] = find_feature_support(attrs, 'exon_rna_support', i)
  File "/hive/users/ifiddes/Comparative-Annotation-Toolkit/cat/consensus.py", line 857, in find_feature_support
    vals = map(bool, attrs[feature].split(','))
KeyError: 'exon_rna_support'

Probably the easiest way to fix this is to change these dict queries to .get(feature, 'n/a')

Handle different methods of gff3, GTF

Ensembl produces a difficult to parse gff3 on their FTP.

Here is an example from Ensembl:

6   ensembl_havana  gene    125161715   125166467   .   -   .   ID=gene:ENSMUSG00000057666;Name=Gapdh;biotype=protein_coding;description=glyceraldehyde-3-phosphate dehydrogenase [Source:MGI Symbol%3BAcc:MGI:95640];gene_id=ENSMUSG00000057666;havana_gene=OTTMUSG00000027118;havana_version=9;logic_name=ensembl_havana_gene;version=18
6   ensembl_havana  transcript  125161853   125166467   .   -   .   ID=transcript:ENSMUST00000117757;Parent=gene:ENSMUSG00000057666;Name=Gapdh-003;biotype=protein_coding;ccdsid=CCDS80613.1;havana_transcript=OTTMUST00000067086;havana_version=3;tag=basic;transcript_id=ENSMUST00000117757;transcript_support_level=1 (assigned to previous version 7);version=8
6   havana  exon    125166394   125166467   .   -   .   Parent=transcript:ENSMUST00000117757;Name=ENSMUSE00001229467;constitutive=0;ensembl_end_phase=0;ensembl_phase=-1;exon_id=ENSMUSE00001229467;rank=1;version=1

And an example from Gencode:

chr6    HAVANA  gene    125161715   125166467   .   -   .   ID=ENSMUSG00000057666.18;gene_id=ENSMUSG00000057666.18;gene_type=protein_coding;gene_status=KNOWN;gene_name=Gapdh;level=2;havana_gene=OTTMUSG00000027118.9
chr6    HAVANA  transcript  125161853   125166467   .   -   .   ID=ENSMUST00000117757.8;Parent=ENSMUSG00000057666.18;gene_id=ENSMUSG00000057666.18;transcript_id=ENSMUST00000117757.8;gene_type=protein_coding;gene_status=KNOWN;gene_name=Gapdh;transcript_type=protein_coding;transcript_status=KNOWN;transcript_name=Gapdh-003;level=2;protein_id=ENSMUSP00000113942.2;transcript_support_level=1;tag=basic,CCDS;ccdsid=CCDS80613.1;havana_gene=OTTMUSG00000027118.9;havana_transcript=OTTMUST00000067086.3
chr6    HAVANA  exon    125166394   125166467   .   -   .   ID=exon:ENSMUST00000117757.8:1;Parent=ENSMUST00000117757.8;gene_id=ENSMUSG00000057666.18;transcript_id=ENSMUST00000117757.8;gene_type=protein_coding;gene_status=KNOWN;gene_name=Gapdh;transcript_type=protein_coding;transcript_status=KNOWN;transcript_name=Gapdh-003;exon_number=1;exon_id=ENSMUSE00001229467.1;level=2;protein_id=ENSMUSP00000113942.2;transcript_support_level=1;ccdsid=CCDS80613.1;havana_gene=OTTMUSG00000027118.9;havana_transcript=OTTMUST00000067086.3;tag=basic,CCDS

Gencode is not producing entirely proper gff3. They do not tag the first level ID for gene or transcript objects. This is actually easier for me, because it means I don't have to munge the output of gff3ToGenePred. I don't see an easy way to be able to seamlessly handle this without some sort of user input. I am tempted to stick with the GTF format instead of gff3 for input, because it is a more consistent format. My thought at this point is to only handle GTF input, and then provide the option for gff3 output (with my annotations). This will mean we cannot transfer important gff3 annotations like selenocysteine, however.

Use fasttree to generate gene trees for all paralogous alignments

This can help resolve the correct ortholog, and probably do better than the synteny score. May also want to consider some sort of consensus of synteny score and tree distance (if more than one sequence is very close to the other sequences, resolve based on synteny).

provide possibility to increase tmp disk space for toil jobs

I am seeing a lot of error messages like the one pasted below. It seems that the 2G disk limit set as default by toil is not sufficient.

02-04 12:51:42 toil.statsAndLogging WARNING Got message from job at time 02-04-2017 12:51:42: Job used more disk than requested. Cconsider modifying the user script to avoid the chance of failure due to incorrectly requested resources. Job A/t/job3ps345/g/tmp7AEQNI.tmp used 169.86% (3.4 GB [3647610880B] used, 2.0 GB [2147483648B] requested) at the end of its run.

AFAIK these jobs are not killed by the scheduler (Slurm in the present case), but it does not seem like a healthy strategy (I have seen some error messages with usage corresponding to 700 % of the limit) and having the possibility to modify --defaultDisk may be a good thing.

error with gff3 file from NCBI

I am attempting to use as ref a gff3 downloaded from NCBI and I ran the convert script before starting the pipeline)

The pipeline stops with the following error message:

ERROR: [pid 36827] Worker Worker(salt=872295220, workers=16, host=holyhoekstra02.rc.fas.harvard.edu, username=lassance, pid=35993) failed Task: Gff3ToAttrs Traceback (most recent call last): File "/n/home01/lassance/.conda/envs/ENV_PROGRESSIVECACTUS/lib/python2.7/site-packages/luigi/worker.py", line 192, in run new_deps = self._run_get_new_deps() File "/n/home01/lassance/.conda/envs/ENV_PROGRESSIVECACTUS/lib/python2.7/site-packages/luigi/worker.py", line 130, in _run_get_new_deps task_gen = self.task.run() File "/n/home01/lassance/Comparative-Annotation-Toolkit/cat/__init__.py", line 710, in run results = tools.gff3.extract_attrs(pipeline_args.annotation) File "/n/home01/lassance/Comparative-Annotation-Toolkit/tools/gff3.py", line 372, in extract_attrs df.StartCodon = pd.to_numeric(df.StartCodon) # force the types here for better sql dtypes File "/n/home01/lassance/.local/lib/python2.7/site-packages/pandas/core/generic.py", line 2744, in __getattr__ return object.__getattribute__(self, name) AttributeError: 'DataFrame' object has no attribute 'StartCodon'

I guess this means that there is a field or attribute missing in the gff3 coming out of NCBI, or?

Parse homGeneMapping into actual intervals

Storing a bit vector is both harder to reason about later and less useful for manual inspection. The homGeneMapping output should be stored as a flat database table with the columns feature_type, chrom, start, stop, value. @diekhans

The termination flag is set, exiting with test_data

Hey Ian,

I finally got the new Augustus and the --paramlist fits now. Running the test data I do get the following issues now.

`07-05 15:56:40 cat.hints_db INFO canFam3 has 0 valid intron-only BAMs, 0 valid BAMs and 0 valid IsoSeq BAMs. Beginning Toil hints pipeline.
Exception in thread Thread-1:
Traceback (most recent call last):
File "/usr/lib/python2.7/threading.py", line 801, in __bootstrap_inner
self.run()
File "/usr/lib/python2.7/threading.py", line 754, in run
self.__target(*self.__args, **self.__kwargs)
File "/ebio/abt6_projects9/abt6_software/bin/Comparative-Annotation-Toolkit/env/local/lib/python2.7/site-packages/toil/fileStore.py", line 1468, in asyncWrite
raise RuntimeError("The termination flag is set, exiting")
RuntimeError: The termination flag is set, exiting

Exception in thread Thread-2:
Traceback (most recent call last):
File "/usr/lib/python2.7/threading.py", line 801, in __bootstrap_inner
self.run()
File "/usr/lib/python2.7/threading.py", line 754, in run
self.__target(*self.__args, **self.__kwargs)
File "/ebio/abt6_projects9/abt6_software/bin/Comparative-Annotation-Toolkit/env/local/lib/python2.7/site-packages/toil/fileStore.py", line 1468, in asyncWrite
raise RuntimeError("The termination flag is set, exiting")
RuntimeError: The termination flag is set, exiting

Exception RuntimeError: RuntimeError('cannot join current thread',) in <bound method CachingFileStore.del of <toil.fileStore.CachingFileStore object at 0x7f5336d35c10>> ignored
07-05 15:56:45 toil.leader WARNING The job seems to have left a log file, indicating failure: 'build_exon_hints' Y/E/job3T1k2E
07-05 15:56:45 toil.leader WARNING Y/E/job3T1k2E ---TOIL WORKER OUTPUT LOG---
07-05 15:56:45 toil.leader WARNING Y/E/job3T1k2E INFO:toil:Running Toil version 3.8.0-4c83830e4f42594d995e01ccc07b47396b88c9e7.
07-05 15:56:45 toil.leader WARNING Y/E/job3T1k2E WARNING:toil.resource:Can't find resource for leader path '/ebio/abt6_projects9/abt6_software/bin/Comparative-Annotation-Toolkit/cat'
07-05 15:56:45 toil.leader WARNING Y/E/job3T1k2E WARNING:toil.resource:Can't localize module ModuleDescriptor(dirPath='/ebio/abt6_projects9/abt6_software/bin/Comparative-Annotation-Toolkit', name='cat.hints_db', fromVirtualEnv=False)
07-05 15:56:45 toil.leader WARNING Y/E/job3T1k2E INFO:toil.fileStore:Starting job ('build_exon_hints' Y/E/job3T1k2E) with ID (d3326af0e2cc5d867826d84a845564007890e304).
07-05 15:56:45 toil.leader WARNING Y/E/job3T1k2E WARNING:toil.resource:Can't find resource for leader path '/ebio/abt6_projects9/abt6_software/bin/Comparative-Annotation-Toolkit/cat'
07-05 15:56:45 toil.leader WARNING Y/E/job3T1k2E WARNING:toil.resource:Can't localize module ModuleDescriptor(dirPath='/ebio/abt6_projects9/abt6_software/bin/Comparative-Annotation-Toolkit', name='cat.hints_db', fromVirtualEnv=False)
07-05 15:56:45 toil.leader WARNING Y/E/job3T1k2E Traceback (most recent call last):
07-05 15:56:45 toil.leader WARNING Y/E/job3T1k2E File "/ebio/abt6_projects9/abt6_software/bin/Comparative-Annotation-Toolkit/env/local/lib/python2.7/site-packages/toil/worker.py", line 340, in main
07-05 15:56:45 toil.leader WARNING Y/E/job3T1k2E job._runner(jobGraph=jobGraph, jobStore=jobStore, fileStore=fileStore)
07-05 15:56:45 toil.leader WARNING Y/E/job3T1k2E File "/ebio/abt6_projects9/abt6_software/bin/Comparative-Annotation-Toolkit/env/local/lib/python2.7/site-packages/toil/job.py", line 1289, in _runner
07-05 15:56:45 toil.leader WARNING Y/E/job3T1k2E returnValues = self._run(jobGraph, fileStore)
07-05 15:56:45 toil.leader WARNING Y/E/job3T1k2E File "/ebio/abt6_projects9/abt6_software/bin/Comparative-Annotation-Toolkit/env/local/lib/python2.7/site-packages/toil/job.py", line 1234, in _run
07-05 15:56:45 toil.leader WARNING Y/E/job3T1k2E return self.run(fileStore)
07-05 15:56:45 toil.leader WARNING Y/E/job3T1k2E File "/ebio/abt6_projects9/abt6_software/bin/Comparative-Annotation-Toolkit/env/local/lib/python2.7/site-packages/toil/job.py", line 1406, in run
07-05 15:56:45 toil.leader WARNING Y/E/job3T1k2E rValue = userFunction(*((self,) + tuple(self._args)), **self._kwargs)
07-05 15:56:45 toil.leader WARNING Y/E/job3T1k2E File "/ebio/abt6_projects9/abt6_software/bin/Comparative-Annotation-Toolkit/cat/hints_db.py", line 284, in build_exon_hints
07-05 15:56:45 toil.leader WARNING Y/E/job3T1k2E tools.procOps.run_proc(cmd, stdout=exon_gff_path)
07-05 15:56:45 toil.leader WARNING Y/E/job3T1k2E File "/ebio/abt6_projects9/abt6_software/bin/Comparative-Annotation-Toolkit/tools/procOps.py", line 36, in run_proc
07-05 15:56:45 toil.leader WARNING Y/E/job3T1k2E pl.wait()
07-05 15:56:45 toil.leader WARNING Y/E/job3T1k2E File "/ebio/abt6_projects9/abt6_software/bin/Comparative-Annotation-Toolkit/tools/pipeline.py", line 1119, in wait
07-05 15:56:45 toil.leader WARNING Y/E/job3T1k2E self.start()
07-05 15:56:45 toil.leader WARNING Y/E/job3T1k2E File "/ebio/abt6_projects9/abt6_software/bin/Comparative-Annotation-Toolkit/tools/pipeline.py", line 1076, in start
07-05 15:56:45 toil.leader WARNING Y/E/job3T1k2E self.__execBarrier()
07-05 15:56:45 toil.leader WARNING Y/E/job3T1k2E File "/ebio/abt6_projects9/abt6_software/bin/Comparative-Annotation-Toolkit/tools/pipeline.py", line 1031, in __execBarrier
07-05 15:56:45 toil.leader WARNING Y/E/job3T1k2E p._execWait()
07-05 15:56:45 toil.leader WARNING Y/E/job3T1k2E File "/ebio/abt6_projects9/abt6_software/bin/Comparative-Annotation-Toolkit/tools/pipeline.py", line 724, in _execWait
07-05 15:56:45 toil.leader WARNING Y/E/job3T1k2E raise ex
07-05 15:56:45 toil.leader WARNING Y/E/job3T1k2E ProcException: exec failed: bam2wig /tmp/toil-6390cf36-f34f-4dc1-8d30-3f8c586ae871/tmp2Z4IVp/451b4634-c5ff-4d8c-ac41-84c1a13797d7/tmpqi_v3M.tmp,
07-05 15:56:45 toil.leader WARNING Y/E/job3T1k2E caused by: OSError: [Errno 2] No such file or directory
07-05 15:56:45 toil.leader WARNING Y/E/job3T1k2E ERROR:toil.worker:Exiting the worker because of a failed job on host burrito
07-05 15:56:45 toil.leader WARNING Y/E/job3T1k2E WARNING:toil.jobGraph:Due to failure we are reducing the remaining retry count of job 'build_exon_hints' Y/E/job3T1k2E with ID Y/E/job3T1k2E to 0
07-05 15:56:45 toil.leader WARNING Job 'build_exon_hints' Y/E/job3T1k2E with ID Y/E/job3T1k2E is completely failed
07-05 15:56:48 cat INFO Finished GenerateHints Toil pipeline for canFam3.
07-05 15:56:48 luigi-interface INFO [pid 40675] Worker Worker(salt=327313050, workers=10, host=burrito, username=fbemm, pid=35395) done ToilTask: GenerateHints for canFam3 using batchSystem singleMachine
07-05 15:56:48 luigi-interface INFO Informed scheduler that task GenerateHints_False_True_True_a0f8dfd546 has status DONE
`
Can you make anything of it?

Thanks a lot,
F

CGP overlap is slow!

Mostly the result of using bedtools to calculate the Jaccard metric, which means average CPU utilization stays around 50%.

Ways to fix:

  1. Split it into chunks of toil jobs.
  2. Implement own Jaccard tester.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.