Code Monkey home page Code Monkey logo

Comments (8)

kishorejaganathan avatar kishorejaganathan commented on August 14, 2024

SpliceAI only scores variants within genes and completely ignores intergenic variants (genes are defined in the annotations/grch37.txt file). My guess is that the variants you are providing are either intergenic or belong to genes that don't exist in the file. Could you share with me a few more details about the variants so that I can better advise you?

from spliceai.

 avatar commented on August 14, 2024

Thanks for the quick reply!

Actually, I thought that the variants in intergenic regions might be also of great interest since it's known that many SNVs associated with traits are harbored by these regions.

The sample.vcf in this case is the result of a variant calling on an exome BAM-file. So it should carry genic variants only but maybe the 'unknown' variants in 0000.vcf might be due to noise, e.g. unspecific exome capture? Alternatively, the genes harboring the variants don't exist in the annotation file?

Would you suggest modifying the annotation file? Actually, I was planning to repeat the whole procedure for a genomic vcf since I was especially interested in 'unknown' variants. However, this might not be useful if I understood you correctly.

from spliceai.

kishorejaganathan avatar kishorejaganathan commented on August 14, 2024

Yes, modifying the annotation file is the easiest way forward. More precisely, you can add lines to the annotation file with the genes/isoforms that you are interested in that are missing currently and you should get scores for variants in those genes/isoforms in the output. The format is fairly intuitive, you can check one or two lines in the existing annotation file just to make sure you understood it right.

SpliceAI uses context on the RNA and not the DNA, so I'm not sure if using it to evaluate intergenic variants is worthwhile. While intergenic variants are known to affect gene expression a lot, I am not sure if they have a serious role in affecting splicing.

from spliceai.

 avatar commented on August 14, 2024

Thanks for your help! I followed your advice. The output file is not empty anymore. Some of the lines contain the required delta scores while others still contain informations of the input file. More severly, spliceAI stops writing to the output-VCF after some minutes (namely: at the gene "SPCS2P4") with the following error message:

Traceback (most recent call last):
File "/home/nailufra/.local/bin/spliceai", line 10, in
sys.exit(main())
File "/home/nailufra/.local/lib/python3.6/site-packages/spliceai/main.py", line 72, in main
scores = get_delta_scores(record, ann, args.D, args.M)
File "/home/nailufra/.local/lib/python3.6/site-packages/spliceai/utils.py", line 132, in get_delta_scores
dist_ann = ann.get_pos_data(idxs[i], record.pos)
File "/home/nailufra/.local/lib/python3.6/site-packages/spliceai/utils.py", line 62, in get_pos_data
dist_exon_bdry = min(np.union1d(self.exon_starts[idx], self.exon_ends[idx])-pos, key=abs)
ValueError: min() arg is an empty sequence

What I did so far:

  1. Downloaded GTF: ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_24/GRCh37_mapping/gencode.v24lift37.basic.annotation.gtf.gz
  2. extracting the start and end information of each gene entry in the GTF (filtered all lines where third columns is "gene")
  3. extracting the information on start and end position of "exon" entries (filtered all lines where third columns is "exon"), grouped by gene and put them comma-separated into a EXON_START and EXON_END column, respectively
  4. merged the gene and exon information into one data frame
  5. ´bedtools intersect´ to filter the data frame (see 4)) with the coordinates of the vcf2bed-transformed 0000.vcf file (see post above).
  6. rearrange columns to get the genePred format
  7. append file to grch37.txt (I also tried to only save the filtered file to grch37.txt and deleted the original entries - same error!)

To me the error message is not completely clear. It seems that the arrangement of the exon start and end positions might be wrong for this specific gene? Would be nice to hear your suggestions on this!

from spliceai.

kishorejaganathan avatar kishorejaganathan commented on August 14, 2024

It looks like you haven't added the extra lines to the annotation files in the right format. Could you copy/paste a few lines of the new genes that you have added to grch37.txt so that I can better advise you?

from spliceai.

 avatar commented on August 14, 2024

The first three lines/genes of the resulting genePred (see bullet point 6 in prev. post) appended to grch37.txt (they're tab-separated in the file but appear to be only space-separated in this post):

RP4-669L17.10 1 + 317720 453948 324756, 327552 326514, 328453
MXRA8 1 - 1288069 1297157 1288069, 1288071, 1288072, 1288072, 1289228, 1289228, 1289228, 1289228, 1289410, 1289410, 1289410, 1289410, 1289573, 1289573, 1289573, 1289573, 1289734, 1289734, 1289734, 1289734, 1290062, 1290062, 1290062, 1290062, 1290624, 1290624, 1290624, 1290624, 1290830, 1290830, 1290830, 1292061, 1292061, 1292061, 1292061, 1293836, 1293836, 1293836, 1296622 1288712, 1289009, 1289009, 1289009, 1289308, 1289308, 1289308, 1289326, 1289486, 1289486, 1289486, 1289486, 1289612, 1289612, 1289612, 1289612, 1289889, 1289889, 1289889, 1289889, 1290532, 1290532, 1290532, 1290532, 1290725, 1290725, 1290725, 1290725, 1291132, 1291132, 1291132, 1292084, 1292084, 1292084, 1292084, 1293915, 1293923, 1294174, 1297157
RP1-140A9.1 1 + 1822910 1824097 1822910, 1823573 1823290, 1824097

SpliceAI throws the error when processing a variant at chr: 1 and pos: 28422847. The variant is harbored by the gene "SPCS2P4". The appended line in grch37.txt corresponding to the variant causing the error is:

SPCS2P4 1 - 28422253 28422933 28422253 28422933

EDIT:
Assuming that the error might be caused by 'single-exon genes' (SEGs) such as SPCS2P4, I removed all lines corresponding to SEGs and the spliceAI run finished successfully!

There are still two questions remaining with this issue:

  1. What might be the problem with these entries?
  2. Assuming that the GENCODE V24 annotation was used as described on the main page, why are some of the entries not present in the provided grch37.txt but in gencode.v24lift37.basic.annotation.gtf.gz?

from spliceai.

kishorejaganathan avatar kishorejaganathan commented on August 14, 2024

Here are the answers to your questions:

  1. SpliceAI also works for single exon genes. We followed the UCSC table browser annotation, which also adds a comma after the last exon. For example:

OR4F5 1 + 69090 70008 69090, 70008,

The parser inside SpliceAI implicitly uses this, and is the reason behind the error that you are facing. Please add a , at the end of each EXON_START and EXON_END entry (this statement applies for multiple exon genes as well). For example,

SPCS2P4 1 - 28422253 28422933 28422253 28422933

should be changed to

SPCS2P4 1 - 28422253 28422933 28422253, 28422933,

and everything should work.

  1. We further filtered the gene list to include only canonical transcripts (when multiple transcripts are available for the same gene, we picked the one with the largest coding sequence).

from spliceai.

 avatar commented on August 14, 2024

Now everything works fine - thank you very much for your generous help!

from spliceai.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.