Hi Ian, Some of gene families that are of particular interest to me

question: how to integrate external results from Augustus-ppx?,about comparativegenomicstoolkit/comparative-annotation-toolkit

Comments (38)

ifiddes commented on September 16, 2024

Unfortunately, I was short-sighted when designing this pipeline in allowing such flexibility. The same is true of providing external hints. I want to re-engineer the whole thing eventually. If I time find soon, I think I can make this possible without too much work, and without the hack below.

How to hack it in: pretend your PPX output is either CGP or PB. Place the GTF you produced in the --work-dir/augustus_cgp or --work-dir/augustus_pb directory with the correct names ($genome.augCGP.gtf). Construct a matching genePred with gtfToGenePred -genePredExt $genome.augCGP.gtf $genome.augCGP.gp. Then restart the pipeline with --augustus-cgp or --augustus-pb set, and it should proceed with parent gene assignment, homGeneMapping and consensus gene set building.

from comparative-annotation-toolkit.

ifiddes commented on September 16, 2024

One other question -- did you ever train your CGP model? CGP relies heavily on being trained for the current alignment. I am still working on automating the process, but there is a guide in more recent versions of the augustus repo. There is a new graduate student in Mario's lab called Lizzie who is working on this. I can hook you two up via email. I haven't heard from them in a few weeks, but they promised me a much more straightforward training approach so that I can integrate it into the pipeline.

from comparative-annotation-toolkit.

lassancejm commented on September 16, 2024

No, I ended up not retraining Augustus. Partly because I don't have a reference gene set, and partly because the human param seems to work pretty well in general. If there is a better solution, I am happy to try it out.

About the ppx thing, thanks, I will probably try your hack sooner than later (i.e. feeding the output as augustus_pb output).

from comparative-annotation-toolkit.

lassancejm commented on September 16, 2024

After pulling the lastest version of the master (071117) with the hope of using the new self-training module with Augustus-CGP, I finally tried feeding external annotation following your advice. Basically, I converted gff3 into .gtf and .gp and added that to the --work-dir/augustus_pb directory and restarted CAT with the --augustus-pb option.

The pipeline failed, complaining that there were no data to run augustus-pb. So, I aligned the cDNA corresponding to the external annotation using gmap and added the resulting bam under [ISO_SEQ_BAM] in the config file.

Now, I get the error message pasted below. Apparently, no PB hints are found still ...

---TOIL WORKER OUTPUT LOG---
INFO:toil:Running Toil version 3.5.2-378fffa320ded1ed1ebade5ec7d01138699db3f6.
WARNING:toil.resource:Can't find resource for leader path '/n/home01/lassance/Comparative-Annotation-Toolkit/cat'
WARNING:toil.resource:Can't localize module ModuleDescriptor(dirPath='/n/home01/lassance/Comparative-Annotation-Toolkit', name='cat.augustus_pb', fromVirtualEnv=False)
WARNING:toil.resource:Can't find resource for leader path '/n/home01/lassance/Comparative-Annotation-Toolkit/cat'
WARNING:toil.resource:Can't localize module ModuleDescriptor(dirPath='/n/home01/lassance/Comparative-Annotation-Toolkit', name='cat.augustus_pb', fromVirtualEnv=False)
Traceback (most recent call last):
  File "/n/home01/lassance/.conda/envs/ENV_PROGRESSIVECACTUS/lib/python2.7/site-packages/toil/worker.py", line 340, in main
    job._runner(jobGraph=jobGraph, jobStore=jobStore, fileStore=fileStore)
  File "/n/home01/lassance/.conda/envs/ENV_PROGRESSIVECACTUS/lib/python2.7/site-packages/toil/job.py", line 1270, in _runner
    returnValues = self._run(jobGraph, fileStore)
  File "/n/home01/lassance/.conda/envs/ENV_PROGRESSIVECACTUS/lib/python2.7/site-packages/toil/job.py", line 1217, in _run
    return self.run(fileStore)
  File "/n/home01/lassance/.conda/envs/ENV_PROGRESSIVECACTUS/lib/python2.7/site-packages/toil/job.py", line 1383, in run
    rValue = userFunction(*((self,) + tuple(self._args)), **self._kwargs)
  File "/n/home01/lassance/Comparative-Annotation-Toolkit/cat/augustus_pb.py", line 64, in setup
    raise RuntimeError('No PB hints found.')
RuntimeError: No PB hints found.
ERROR:toil.worker:Exiting the worker because of a failed job on host holy2a08207.rc.fas.harvard.edu
WARNING:toil.jobGraph:Due to failure we are reducing the remaining retry count of job 'setup' n/U/jobwaP8zf with ID n/U/jobwaP8zf to 0

Any idea of what is happening here?

from comparative-annotation-toolkit.

ifiddes commented on September 16, 2024

I didn't think through this hack entirely -- as you encountered, the pipeline checks for an ISO_SEQ_BAM before allowing the PB module to run.

But, what you did should have worked -- did you delete the --work-dir/hints_database directory? If you go into that folder and grep src=PB do you get hints?

The pipeline is supposed to re-run key steps like that if the config file has a different hash than the previous run, maybe that process broke somehow. I will look into it. In the meantime, if there are no src=PB hints in your hints GFF files, then you need to have it re-do the hints building step by removing that folder. That grep step is effectively what the pipeline is doing at this step, and finding nothing, and so raising an exception.

from comparative-annotation-toolkit.

lassancejm commented on September 16, 2024

Mmmh, it makes sense. I have deleted the hints_database and restarted the pipeline. Will post an update as soon as I can. Thanks again!

from comparative-annotation-toolkit.

ifiddes commented on September 16, 2024

Hi,

Just wanted to let you know that with the latest commit (#a69c959) I have introduced the ability to provide a protein FASTA and have hints be automatically generated by performing BLAT alignments of this file against the genomes listed in the config.

I will next provide a way to directly feed in your own extra GFF, if desired. I have to think a bit about how to perform this, because augustus config files need to be tuned for the types of hints being provided, and allowing open-ended hints could lead to problems.

from comparative-annotation-toolkit.

lassancejm commented on September 16, 2024

Thanks for the info, it sounds extremely useful.

About the original issue, the --augustus-pb completed succesfully. I then fed CAT my own manually curated .gp and .gtf and relaunched so that these files where used for homGenMapping.

Also, as you suggested to do so, I am running augCGP using --cgp-train-num-exons 10000 to train Augustus (I have also upgraded to version 3.3.0 of Augustus). This seems to be running for a while; how long to you expect this step to take?

from comparative-annotation-toolkit.

ifiddes commented on September 16, 2024

Are you on a version of CAT >= #8f0a85d? That commit is when I integrated parallelization of the training. This greatly reduced the runtime, but it still may take a while. In my current 25-genome run it took about 12 hours to convert the HAL to MAF chunks and then another 6 hours to train. The same alignment, before parallelization, took 11 days to train.

On Wed, Aug 2, 2017 at 12:55 PM lassancejm ***@***.***> wrote: Thanks for the info, it sounds extremely useful. About the original issue, the --augustus-pb completed succesfully. I then fed CAT my own manually curated .gp and .gtf and relaunched so that these files where used for homGenMapping. Also, as you suggested to do so, I am running augCGP using --cgp-train-num-exons 10000 to train Augustus (I have also upgraded to version 3.3.0 of Augustus). This seems to be running for a while; how long to you expect this step to take? — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#54 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AHdLXTSsb0nt0qP-bMb1J6w5zPd9z4Prks5sUNRMgaJpZM4NMwni> .

-- Ian Fiddes PhD Student, Haussler lab UC Santa Cruz Genomics Institute

from comparative-annotation-toolkit.

lassancejm commented on September 16, 2024

Ah, that is it, I am using the previous commit. Good to know that you already fixed this (you're the best!). Will update accordingly then. Thanks!

BTW, is --cgp-train-num-exons 10000 a good setting?

from comparative-annotation-toolkit.

ifiddes commented on September 16, 2024

Unfortunately like before you'll need to blow away the toil dir if you choose to restart, and rerun the maf extraction step. I have plans to make them discrete modules in the future. You may also need to make sure your Augustus is above revision 1303 in their repository. I am not sure what release that equates to, or if it is an actual release yet. Ian Fiddes

…

On Aug 2, 2017, at 3:47 PM, lassancejm ***@***.***> wrote: Ah, that is it, I am using the previous commit. Good to know that you already fixed this (you're the best!). Will update accordingly then. Thanks! — You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread.

from comparative-annotation-toolkit.

lassancejm commented on September 16, 2024

hmm, I could not figure out to what revision version 3.3.0 corresponds to, but I was encouraged by what is in the description of that release as compatibility with CAT is explicitly mentioned:

List of changes from version 3.2.3 to 3.3 (until July 11th, 2017)
     - new program ESPOCA to estimate selective pressure on codon alignments
     - gene finding on ancestral genomes is enabled
     - new default parameters for comparative gene prediction (CGP)
     - clade parameters training for CGP
     - compatibility to Ian Fiddes' Comparative Annotation Toolkit (CAT)
     - new scripts eval_dualdecomp.pl,
     - more tolerant tree parsing
     - bugfixes in augustus, joingenes, load2sqlitedb, transMap2hints.pl, splitMfasta.pl, intron2exex.pl,
       aln2wig
     - new functionality in homGeneMapping, joingenes

from comparative-annotation-toolkit.

ifiddes commented on September 16, 2024

Did you ever get CGP to work? I have (I think) finally gotten the protein based evidence portion to work, which should help CGP prediction in species without RNA-seq or highly divergent species.

from comparative-annotation-toolkit.

lassancejm commented on September 16, 2024

Thanks for the info. I will give it a try.

from comparative-annotation-toolkit.

lassancejm commented on September 16, 2024

Hi Ian,

I have a job running now with protein data.

Things seem to not be working properly:

---TOIL WORKER OUTPUT LOG---
INFO:toil:Running Toil version 3.5.2-378fffa320ded1ed1ebade5ec7d01138699db3f6.
WARNING:toil.resource:Can't find resource for leader path '/n/home01/lassance/Comparative-Annotation-Toolkit/cat'
WARNING:toil.resource:Can't localize module ModuleDescriptor(dirPath='/n/home01/lassance/Comparative-Annotation-Toolkit', name='cat.hints_db', fromVirtualEnv=False)
WARNING:toil.resource:Can't find resource for leader path '/n/home01/lassance/Comparative-Annotation-Toolkit/cat'
WARNING:toil.resource:Can't localize module ModuleDescriptor(dirPath='/n/home01/lassance/Comparative-Annotation-Toolkit', name='cat.hints_db', fromVirtualEnv=False)
Traceback (most recent call last):
  File "/n/home01/lassance/.conda/envs/ENV_PROGRESSIVECACTUS/lib/python2.7/site-packages/toil/worker.py", line 340, in main
    job._runner(jobGraph=jobGraph, jobStore=jobStore, fileStore=fileStore)
  File "/n/home01/lassance/.conda/envs/ENV_PROGRESSIVECACTUS/lib/python2.7/site-packages/toil/job.py", line 1270, in _runner
    returnValues = self._run(jobGraph, fileStore)
  File "/n/home01/lassance/.conda/envs/ENV_PROGRESSIVECACTUS/lib/python2.7/site-packages/toil/job.py", line 1217, in _run
    return self.run(fileStore)
  File "/n/home01/lassance/.conda/envs/ENV_PROGRESSIVECACTUS/lib/python2.7/site-packages/toil/job.py", line 1383, in run
    rValue = userFunction(*((self,) + tuple(self._args)), **self._kwargs)
  File "/n/home01/lassance/Comparative-Annotation-Toolkit/cat/hints_db.py", line 309, in run_protein_blat
    return job.fileStore.writeGlobalFile(tmp_psl)
  File "/n/home01/lassance/.conda/envs/ENV_PROGRESSIVECACTUS/lib/python2.7/site-packages/toil/fileStore.py", line 1646, in writeGlobalFile
    fileStoreID = self.jobStore.writeFile(absLocalFileName, cleanupID)
  File "/n/home01/lassance/.conda/envs/ENV_PROGRESSIVECACTUS/lib/python2.7/site-packages/toil/jobStores/fileJobStore.py", line 212, in writeFile
    shutil.copyfile(localFilePath, absPath)
  File "/n/home01/lassance/.conda/envs/ENV_PROGRESSIVECACTUS/lib/python2.7/shutil.py", line 82, in copyfile
    with open(src, 'rb') as fsrc:
IOError: [Errno 2] No such file or directory: '/scratch/tmp/toil-2dda024a-60b3-4486-9070-bb8aeead8cca/tmpSNEzeR/6db70d1e-d16d-4e04-83a2-6c48fd00c09f/holy2a08107.rc.fas.harvard.edu.20716.9229939338.tmp'
ERROR:toil.worker:Exiting the worker because of a failed job on host holy2a08107.rc.fas.harvard.edu
WARNING:toil.jobGraph:Due to failure we are reducing the remaining retry count of job 'run_protein_blat' A/6/jobbLlQWK with ID A/6/jobbLlQWK to 0
WARNING:toil.jobGraph:We have increased the default memory of the failed job 'run_protein_blat' A/6/jobbLlQWK to 13958643712 bytes

Thanks for your help troubleshooting this!

from comparative-annotation-toolkit.

ifiddes commented on September 16, 2024

Added a commit on this. The issue here is that the protein-genome alignments that BLAT produces sometimes produce invalid alignments, which we filter with pslCheck. This is a hack, but it's way faster than trying to use exonerate.

In your case, it seems that every single alignment failed for that specific input chunk, and so the output file never got created. This should bypass that now, but I don't have a test set for this case.

from comparative-annotation-toolkit.

lassancejm commented on September 16, 2024

Good. Your patch seems to work for that error.

Now I start seeing another set of error messages (related to a bam sorting step):

---TOIL WORKER OUTPUT LOG---
INFO:toil:Running Toil version 3.5.2-378fffa320ded1ed1ebade5ec7d01138699db3f6.
WARNING:toil.resource:Can't find resource for leader path '/n/home01/lassance/Comparative-Annotation-Toolkit/cat'
WARNING:toil.resource:Can't localize module ModuleDescriptor(dirPath='/n/home01/lassance/Comparative-Annotation-Toolkit', name='cat.hints_db', fromVirtualEnv=False)
WARNING:toil.fileStore:Starting job i/c/jobFOCViW/g/tmpVg8kib.tmp with less than 10% of disk space remaining.
WARNING:toil.resource:Can't find resource for leader path '/n/home01/lassance/Comparative-Annotation-Toolkit/cat'
WARNING:toil.resource:Can't localize module ModuleDescriptor(dirPath='/n/home01/lassance/Comparative-Annotation-Toolkit', name='cat.hints_db', fromVirtualEnv=False)
Traceback (most recent call last):
  File "/n/home01/lassance/.conda/envs/ENV_PROGRESSIVECACTUS/lib/python2.7/site-packages/toil/worker.py", line 340, in main
    job._runner(jobGraph=jobGraph, jobStore=jobStore, fileStore=fileStore)
  File "/n/home01/lassance/.conda/envs/ENV_PROGRESSIVECACTUS/lib/python2.7/site-packages/toil/job.py", line 1270, in _runner
    returnValues = self._run(jobGraph, fileStore)
  File "/n/home01/lassance/.conda/envs/ENV_PROGRESSIVECACTUS/lib/python2.7/site-packages/toil/job.py", line 1217, in _run
    return self.run(fileStore)
  File "/n/home01/lassance/.conda/envs/ENV_PROGRESSIVECACTUS/lib/python2.7/site-packages/toil/job.py", line 1383, in run
    rValue = userFunction(*((self,) + tuple(self._args)), **self._kwargs)
  File "/n/home01/lassance/Comparative-Annotation-Toolkit/cat/hints_db.py", line 155, in namesort_bam
    bam_path = job.fileStore.readGlobalFile(bam_file_id)
  File "/n/home01/lassance/.conda/envs/ENV_PROGRESSIVECACTUS/lib/python2.7/site-packages/toil/fileStore.py", line 1658, in readGlobalFile
    self.jobStore.readFile(fileStoreID, localFilePath)
  File "/n/home01/lassance/.conda/envs/ENV_PROGRESSIVECACTUS/lib/python2.7/site-packages/toil/jobStores/fileJobStore.py", line 251, in readFile
    shutil.copyfile(jobStoreFilePath, localFilePath)
  File "/n/home01/lassance/.conda/envs/ENV_PROGRESSIVECACTUS/lib/python2.7/shutil.py", line 84, in copyfile
    copyfileobj(fsrc, fdst)
  File "/n/home01/lassance/.conda/envs/ENV_PROGRESSIVECACTUS/lib/python2.7/shutil.py", line 52, in copyfileobj
    fdst.write(buf)
IOError: [Errno 28] No space left on device
ERROR:toil.worker:Exiting the worker because of a failed job on host holy2a02206.rc.fas.harvard.edu
WARNING:toil.jobGraph:Due to failure we are reducing the remaining retry count of job 'namesort_bam' N/u/jobNEbiKh with ID N/u/jobNEbiKh to 0

---TOIL WORKER OUTPUT LOG---
INFO:toil:Running Toil version 3.5.2-378fffa320ded1ed1ebade5ec7d01138699db3f6.
WARNING:toil.resource:Can't find resource for leader path '/n/home01/lassance/Comparative-Annotation-Toolkit/cat'
WARNING:toil.resource:Can't localize module ModuleDescriptor(dirPath='/n/home01/lassance/Comparative-Annotation-Toolkit', name='cat.hints_db', fromVirtualEnv=False)
WARNING:toil.fileStore:Starting job t/5/jobQu96hO/g/tmp3xDeNW.tmp with less than 10% of disk space remaining.
WARNING:toil.resource:Can't find resource for leader path '/n/home01/lassance/Comparative-Annotation-Toolkit/cat'
WARNING:toil.resource:Can't localize module ModuleDescriptor(dirPath='/n/home01/lassance/Comparative-Annotation-Toolkit', name='cat.hints_db', fromVirtualEnv=False)
[E::bgzf_flush] hwrite error (wrong size)
Traceback (most recent call last):
  File "/n/home01/lassance/.conda/envs/ENV_PROGRESSIVECACTUS/lib/python2.7/site-packages/toil/worker.py", line 340, in main
    job._runner(jobGraph=jobGraph, jobStore=jobStore, fileStore=fileStore)
  File "/n/home01/lassance/.conda/envs/ENV_PROGRESSIVECACTUS/lib/python2.7/site-packages/toil/job.py", line 1270, in _runner
    returnValues = self._run(jobGraph, fileStore)
  File "/n/home01/lassance/.conda/envs/ENV_PROGRESSIVECACTUS/lib/python2.7/site-packages/toil/job.py", line 1217, in _run
    return self.run(fileStore)
  File "/n/home01/lassance/.conda/envs/ENV_PROGRESSIVECACTUS/lib/python2.7/site-packages/toil/job.py", line 1383, in run
    rValue = userFunction(*((self,) + tuple(self._args)), **self._kwargs)
  File "/n/home01/lassance/Comparative-Annotation-Toolkit/cat/hints_db.py", line 181, in namesort_bam
    file_id = write_bam(r, ns_handle)
  File "/n/home01/lassance/Comparative-Annotation-Toolkit/cat/hints_db.py", line 151, in write_bam
    outf_h.write(rec)
  File "pysam/libcalignmentfile.pyx", line 1334, in pysam.libcalignmentfile.AlignmentFile.write (pysam/libcalignmentfile.c:15439)
  File "pysam/libcalignmentfile.pyx", line 1363, in pysam.libcalignmentfile.AlignmentFile.write (pysam/libcalignmentfile.c:15367)
IOError: sam_write1 failed with error code -1
ERROR:toil.worker:Exiting the worker because of a failed job on host holy2a04106.rc.fas.harvard.edu
WARNING:toil.jobGraph:Due to failure we are reducing the remaining retry count of job 'namesort_bam' r/c/jobNvseen with ID r/c/jobNvseen to 0

---TOIL WORKER OUTPUT LOG---
INFO:toil:Running Toil version 3.5.2-378fffa320ded1ed1ebade5ec7d01138699db3f6.
WARNING:toil.resource:Can't find resource for leader path '/n/home01/lassance/Comparative-Annotation-Toolkit/cat'
WARNING:toil.resource:Can't localize module ModuleDescriptor(dirPath='/n/home01/lassance/Comparative-Annotation-Toolkit', name='cat.hints_db', fromVirtualEnv=False)
WARNING:toil.fileStore:Starting job t/5/jobQu96hO/g/tmpNYLQ55.tmp with less than 10% of disk space remaining.
WARNING:toil.resource:Can't find resource for leader path '/n/home01/lassance/Comparative-Annotation-Toolkit/cat'
WARNING:toil.resource:Can't localize module ModuleDescriptor(dirPath='/n/home01/lassance/Comparative-Annotation-Toolkit', name='cat.hints_db', fromVirtualEnv=False)
sambamba-sort: Unable to write to stream
Traceback (most recent call last):
  File "/n/home01/lassance/.conda/envs/ENV_PROGRESSIVECACTUS/lib/python2.7/site-packages/toil/worker.py", line 340, in main
    job._runner(jobGraph=jobGraph, jobStore=jobStore, fileStore=fileStore)
  File "/n/home01/lassance/.conda/envs/ENV_PROGRESSIVECACTUS/lib/python2.7/site-packages/toil/job.py", line 1270, in _runner
    returnValues = self._run(jobGraph, fileStore)
  File "/n/home01/lassance/.conda/envs/ENV_PROGRESSIVECACTUS/lib/python2.7/site-packages/toil/job.py", line 1217, in _run
    return self.run(fileStore)
  File "/n/home01/lassance/.conda/envs/ENV_PROGRESSIVECACTUS/lib/python2.7/site-packages/toil/job.py", line 1383, in run
    rValue = userFunction(*((self,) + tuple(self._args)), **self._kwargs)
  File "/n/home01/lassance/Comparative-Annotation-Toolkit/cat/hints_db.py", line 161, in namesort_bam
    tools.procOps.run_proc(cmd, stdout=name_sorted)
  File "/n/home01/lassance/Comparative-Annotation-Toolkit/tools/procOps.py", line 36, in run_proc
    pl.wait()
  File "/n/home01/lassance/Comparative-Annotation-Toolkit/tools/pipeline.py", line 1127, in wait
    self.raiseIfExcept()
  File "/n/home01/lassance/Comparative-Annotation-Toolkit/tools/pipeline.py", line 1085, in raiseIfExcept
    p.raiseIfExcept()
  File "/n/home01/lassance/Comparative-Annotation-Toolkit/tools/pipeline.py", line 749, in raiseIfExcept
    raise self.exceptInfo[0], self.exceptInfo[1], self.exceptInfo[2]
ProcException: process exited 1: sambamba sort -t 4 -m 15G -o /dev/stdout -n /dev/stdin
ERROR:toil.worker:Exiting the worker because of a failed job on host holy2a02206.rc.fas.harvard.edu
WARNING:toil.jobGraph:Due to failure we are reducing the remaining retry count of job 'namesort_bam' 8/b/job37GqEO with ID 8/b/job37GqEO to 0

I think I captured the different types of error message I see. My intuition is that it has to do with some parametrization. What do you think?

Thanks!

from comparative-annotation-toolkit.

ifiddes commented on September 16, 2024

Hmm. The first error is easy -- the location of your $TMPDIR is out of space. Toil automatically places all of its work in that location unless you specify the --workDir flag. If you do specify that flag, that location needs to be accessible by all nodes on the cluster (and preferably something fast, i.e. not NFS). The other errors are more vague, but my guess is that they are both symptoms of the same problem -- no space to write.

If clearing enough temp space is not an option, does your file system setup allow for using the --workDir flag? I personally always use it, because I also have issues with other people filling the tempdir on cluster nodes. If you set --workDir to a shared filesystem, you should probably also set --disableCaching to avoid needless file copying.

from comparative-annotation-toolkit.

lassancejm commented on September 16, 2024

OK, seems that I was a bit quick crying for help here, sorry about that. I restarted the pipeline and those failed jobs got repaired successfully. I may try the --workDir if this re-occurs. By default I specify that toil jobs should land on nodes that have a least 20G of temporary disk space available, which of course doesn't prevent someone else to cause trouble.

from comparative-annotation-toolkit.

lassancejm commented on September 16, 2024

reviving this thread, although I am not sure if the mis-behavior I observe has to do with augustus-pb per se.
CAT finish fine but I see abnormally long gene prediction(s).

chr1	CAT	gene	483503	193059942	.	-	.	ID=BEAST_G0000001;Name=None;gene_biotype=unknown_likely_coding;source_gene=None;source_gene_common_name=None;transcript_modes=augPB,augCGP

This 'thing' contains 2073 transcripts...
Is it something you have seen before?

from comparative-annotation-toolkit.

ifiddes commented on September 16, 2024

False fusions are often a problem with these kinds of ab-initio predictions, but that is crazy crazy long. Is it possible to share the assembly hub? If I remember correctly, you provided it with a pre-formed dataset derived from augustus-ppx, right? In that case, this should exist in the input set (CAT will do nothing with it past that point but classify it and decide whether to include or exclude it). Or are you actually running AugustusPB now? If that is the case, what hints are in the hints database? These false fusions occur most often when the model has only sequence information to go off of.

from comparative-annotation-toolkit.

ifiddes commented on September 16, 2024

Also just saw that both CGP and PB predicted the same thing. Also very interesting.

from comparative-annotation-toolkit.

lassancejm commented on September 16, 2024

I mapped the transcripts from my curated annotation using gmap and used that as IsoSeq hints to run PB (previously, CAT complained that I was not providing data). I guess I was too optimistic thinking that AugustusPB would generate preds corresponding perfectly to my curated annotation as it sounds like there could be not enough info to generate reliable prediction with PB. I may roll back to the initial plan, and replace the PB prediction by my own gtf and regenerate the consensus if you think that this is what is causing the prb.

It is a bit confusing that the gene as transcript_modes=augPB,augCGP because individual transcripts have either transcript_modes=augCGP or transcript_mode=augPB, but never the two together. Is there such thing as a 'proximity' rule to define when transcripts belong to the same gene in the consensus (i.e. if two things are less than x bp from each other, than they belong to the same thing, a bit like what Cufflinks does for example)

I ran CAT without the --assembly-hub flag. I guess re-runnning CAT with that option would produce the assembly hub.

from comparative-annotation-toolkit.

ifiddes commented on September 16, 2024

Yes, adding --assembly-hub will generate the hub. I am also a bit confused by the gene level tag, that sounds like it is a bug in my GFF3 writing. What does the gp_info say for that gene? I would love to take a look at the hub, as PB should do a decent job at taking your existing annotations. If not, I want to make it work.

On Wed, Nov 8, 2017 at 8:58 AM lassancejm ***@***.***> wrote: I mapped the transcripts from my curated annotation using gmap and used that as IsoSeq hints to run PB (previously, CAT complained that I was not providing data). I guess I was too optimistic thinking that AugustusPB would generate preds corresponding perfectly to my curated annotation as it sounds like there could be not enough info to generate reliable prediction with PB. I may roll back to the initial plan, and replace the PB prediction by my own gtf and regenerate the consensus if you think that this is what is causing the prb. It is a bit confusing that the gene as transcript_modes=augPB,augCGP because individual transcripts have either transcript_modes=augCGP or transcript_mode=augPB, but never the two together. Is there such thing as a 'proximity' rule to define when transcripts belong to the same gene in the consensus (i.e. if two things are less than x bp from each other, than they belong to the same thing, a bit like what Cufflinks does for example) I ran CAT without the --assembly-hub flag. I guess re-runnning CAT with that option would produce the assembly hub. — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#54 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AHdLXfhe4ZWdiwTIXbakKio6r6jOn966ks5s0d3FgaJpZM4NMwni> .

-- Ian Fiddes PhD Student, Haussler lab UC Santa Cruz Genomics Institute

from comparative-annotation-toolkit.

lassancejm commented on September 16, 2024

Here is the gp_info associated with that gene.

Will restart CAT momentarily to generate the hub.

BEAST__G0000001.gp_info.txt

from comparative-annotation-toolkit.

ifiddes commented on September 16, 2024

Ya, something is wrong here. The gene counter isn't being incremented properly. I will look into it.

On Wed, Nov 8, 2017 at 9:52 AM lassancejm ***@***.***> wrote: Here is the gp_info associated with that gene. Will restart CAT momentarily to generate the hub. BEAST__G0000001.gp_info.txt <https://github.com/ComparativeGenomicsToolkit/Comparative-Annotation-Toolkit/files/1454896/BEAST__G0000001.gp_info.txt> — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#54 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AHdLXazzq1y9D1xkW48u9uyUIBv06O_iks5s0epVgaJpZM4NMwni> .

-- Ian Fiddes PhD Student, Haussler lab UC Santa Cruz Genomics Institute

from comparative-annotation-toolkit.

ifiddes commented on September 16, 2024

What does the first few lines of the PB .gp file look like? I think something must be wrong with the name2 field, as I rely on that field to know what gene we are looking at in novel predictions.

from comparative-annotation-toolkit.

lassancejm commented on September 16, 2024

here are a few lines:

augPB-67.t1	chr1	-	483502	517463	483502	517463	6	483502,490014,493354,496742,497905,517254,	484422,490138,493582,497546,498197,517463,	0	augPB-67	cmpl	cmpl	1,0,0,0,2,0,
augPB-68.t1	chr1	+	615108	631032	615108	631032	5	615108,615944,620283,623370,630970,	615240,616172,620407,623382,631032,	0	augPB-68	cmpl	cmpl	0,0,0,1,1,
augPB-69.t1	chr1	-	653159	689365	653159	689365	6	653159,675347,679440,682991,684071,689158,	653188,676154,679564,683219,684872,689365,	0	augPB-69	cmpl	cmpl	1,1,0,0,0,0,
augPB-70.t1	chr1	+	1276604	1430174	1276604	1430174	2	1276604,1430142,	1277922,1430174,	0	augPB-70	cmpl	cmpl	0,1,
augPB-71.t1	chr1	-	1540364	1546767	1540364	1546767	4	1540364,1543911,1545556,1546629,	1541293,1544035,1545775,1546767,	0	augPB-71	cmpl	cmpl	1,0,0,0,
augPB-72.t1	chr1	-	2029082	2029328	2029082	2029328	1	2029082,	2029328,	0	augPB-72	cmpl	cmpl	0,
augPB-73.t1	chr1	+	2133805	2298726	2133805	2298726	9	2133805,2148173,2148821,2150309,2151995,2155825,2210621,2264193,2298716,	2134056,2148465,2149628,2150525,2152119,2155899,2210664,2264293,2298726,	0	augPB-73	cmpl	cmpl	0,2,0,0,0,1,0,1,2,
augPB-73.t2	chr1	+	2149607	2317495	2149607	2317495	11	2149607,2150309,2151995,2155825,2210621,2298684,2302178,2302828,2304322,2313142,2316584,	2149628,2150525,2152119,2155899,2210664,2298760,2302470,2303635,2304541,2313266,2317495,	0	augPB-73	cmpl	cmpl	0,0,0,1,0,1,2,0,0,0,1,
augPB-73.t3	chr1	+	2149607	2322145	2149607	2322145	13	2149607,2150309,2151995,2155825,2210621,2264193,2276261,2302178,2302828,2304322,2313142,2316584,2322099,	2149628,2150525,2152119,2155899,2210664,2264293,2276516,2302470,2303635,2304541,2313266,2317491,2322145,	0	augPB-73	cmpl	cmpl	0,0,0,1,0,1,2,2,0,0,0,1,2,
augPB-74.t1	chr1	+	2392595	2459244	2392595	2459244	4	2392595,2392933,2394632,2459218,	2392616,2393152,2394756,2459244,	0	augPB-74	cmpl	cmpl	0,0,0,1,

from comparative-annotation-toolkit.

ifiddes commented on September 16, 2024

That is weird, I don't understand why it is broken then. I hate to ask, but is there any way you could share your database files? I may end up needing the full input to consensus finding to track this down, but for now I think the $genome.db file will help.

from comparative-annotation-toolkit.

lassancejm commented on September 16, 2024

Something crossed my mind: I noticed that you fixed a bug in tools/transcripts.py, which I had not noticed.
Do you think this may have anything to do with the erroneous consensus generation?

from comparative-annotation-toolkit.

ifiddes commented on September 16, 2024

I don't think so.

I think the bug is here:

https://github.com/ComparativeGenomicsToolkit/Comparative-Annotation-Toolkit/blob/master/cat/consensus.py#L724-L728

I keep track of gene IDs to assign unique identifiers, handling the case where sorted order may not be gene order. source_gene should always be None for a CGP/PB transcript that was not assigned a parental gene from transMap, and so then I assign it to the name2 field. Somehow I think this is not incrementing properly, but it does for my test cases. For that reason, I was going to look at your database and see what your AugPbAlternativeGenes table contained.

from comparative-annotation-toolkit.

lassancejm commented on September 16, 2024

Finally getting back to this after doing some testing.
First, I don't think there is a bug, but more likely some inconsistencies were introduced when I was troubleshooting the issues resulting form the update of the toil module. I ended up deleting the database, the hgm folder and restarted the pipeline. Now, the consensus does not contain this very long gene anymore.
Second, and this is somewhat secondary, I tried to replace the augPB.gtf automatically produced by my own. However, after CAT finished, I could see that the original got restored and used. So the hack did not work.

from comparative-annotation-toolkit.

ifiddes commented on September 16, 2024

It should work. However, you will need to replace the genePred, not the GTF. The GTF is the direct output of AugustusPB, but CAT works in genePred space.

So you will want to use the Kent program gtfToGenePred to replace that file. I realized one other hack that would need to be done -- the pipeline relies on a consistent naming scheme, where each augustusPB transcript ID is of the form augPB-X.tY and gene ID is of the form augPB-X. I am defending next week, so after the Thanksgiving break I should have time to add the ability to directly incorporate external gene predictions in the process.

from comparative-annotation-toolkit.

lassancejm commented on September 16, 2024

I think I followed those steps, but will doublecheck.

I am defending next week

I should let you focus on that then and refrain from bugging you until after Thanksgiving. Good luck with your defense!

from comparative-annotation-toolkit.

ifiddes commented on September 16, 2024

Can I close this, or are there still outstanding issues?

from comparative-annotation-toolkit.

lassancejm commented on September 16, 2024

I guess it can be closed; as is I never got this to work the way I wanted. It seems easier to merge the CAT output with an external gff3 afterwards.

from comparative-annotation-toolkit.

fbemm commented on September 16, 2024

Those CAT actually output the protein evidence tracks separately besides in the AssemblyHub?

from comparative-annotation-toolkit.

ifiddes commented on September 16, 2024

If you didn't get it to work, I will fix it.

I am going to start a new issue and add a method to directly provide additional transcripts to CAT. That should be easy.

from comparative-annotation-toolkit.

question: how to integrate external results from Augustus-ppx? about comparative-annotation-toolkit HOT 38 CLOSED

Comments (38)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent