Comments (8)
SpliceAI only scores variants within genes and completely ignores intergenic variants (genes are defined in the annotations/grch37.txt file). My guess is that the variants you are providing are either intergenic or belong to genes that don't exist in the file. Could you share with me a few more details about the variants so that I can better advise you?
from spliceai.
Thanks for the quick reply!
Actually, I thought that the variants in intergenic regions might be also of great interest since it's known that many SNVs associated with traits are harbored by these regions.
The sample.vcf
in this case is the result of a variant calling on an exome BAM-file. So it should carry genic variants only but maybe the 'unknown' variants in 0000.vcf
might be due to noise, e.g. unspecific exome capture? Alternatively, the genes harboring the variants don't exist in the annotation file?
Would you suggest modifying the annotation file? Actually, I was planning to repeat the whole procedure for a genomic vcf since I was especially interested in 'unknown' variants. However, this might not be useful if I understood you correctly.
from spliceai.
Yes, modifying the annotation file is the easiest way forward. More precisely, you can add lines to the annotation file with the genes/isoforms that you are interested in that are missing currently and you should get scores for variants in those genes/isoforms in the output. The format is fairly intuitive, you can check one or two lines in the existing annotation file just to make sure you understood it right.
SpliceAI uses context on the RNA and not the DNA, so I'm not sure if using it to evaluate intergenic variants is worthwhile. While intergenic variants are known to affect gene expression a lot, I am not sure if they have a serious role in affecting splicing.
from spliceai.
Thanks for your help! I followed your advice. The output file is not empty anymore. Some of the lines contain the required delta scores while others still contain informations of the input file. More severly, spliceAI stops writing to the output-VCF after some minutes (namely: at the gene "SPCS2P4") with the following error message:
Traceback (most recent call last):
File "/home/nailufra/.local/bin/spliceai", line 10, in
sys.exit(main())
File "/home/nailufra/.local/lib/python3.6/site-packages/spliceai/main.py", line 72, in main
scores = get_delta_scores(record, ann, args.D, args.M)
File "/home/nailufra/.local/lib/python3.6/site-packages/spliceai/utils.py", line 132, in get_delta_scores
dist_ann = ann.get_pos_data(idxs[i], record.pos)
File "/home/nailufra/.local/lib/python3.6/site-packages/spliceai/utils.py", line 62, in get_pos_data
dist_exon_bdry = min(np.union1d(self.exon_starts[idx], self.exon_ends[idx])-pos, key=abs)
ValueError: min() arg is an empty sequence
What I did so far:
- Downloaded GTF: ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_24/GRCh37_mapping/gencode.v24lift37.basic.annotation.gtf.gz
- extracting the start and end information of each gene entry in the GTF (filtered all lines where third columns is "gene")
- extracting the information on start and end position of "exon" entries (filtered all lines where third columns is "exon"), grouped by gene and put them comma-separated into a EXON_START and EXON_END column, respectively
- merged the gene and exon information into one data frame
- ´bedtools intersect´ to filter the data frame (see 4)) with the coordinates of the vcf2bed-transformed 0000.vcf file (see post above).
- rearrange columns to get the genePred format
- append file to grch37.txt (I also tried to only save the filtered file to grch37.txt and deleted the original entries - same error!)
To me the error message is not completely clear. It seems that the arrangement of the exon start and end positions might be wrong for this specific gene? Would be nice to hear your suggestions on this!
from spliceai.
It looks like you haven't added the extra lines to the annotation files in the right format. Could you copy/paste a few lines of the new genes that you have added to grch37.txt so that I can better advise you?
from spliceai.
The first three lines/genes of the resulting genePred (see bullet point 6 in prev. post) appended to grch37.txt (they're tab-separated in the file but appear to be only space-separated in this post):
RP4-669L17.10 1 + 317720 453948 324756, 327552 326514, 328453
MXRA8 1 - 1288069 1297157 1288069, 1288071, 1288072, 1288072, 1289228, 1289228, 1289228, 1289228, 1289410, 1289410, 1289410, 1289410, 1289573, 1289573, 1289573, 1289573, 1289734, 1289734, 1289734, 1289734, 1290062, 1290062, 1290062, 1290062, 1290624, 1290624, 1290624, 1290624, 1290830, 1290830, 1290830, 1292061, 1292061, 1292061, 1292061, 1293836, 1293836, 1293836, 1296622 1288712, 1289009, 1289009, 1289009, 1289308, 1289308, 1289308, 1289326, 1289486, 1289486, 1289486, 1289486, 1289612, 1289612, 1289612, 1289612, 1289889, 1289889, 1289889, 1289889, 1290532, 1290532, 1290532, 1290532, 1290725, 1290725, 1290725, 1290725, 1291132, 1291132, 1291132, 1292084, 1292084, 1292084, 1292084, 1293915, 1293923, 1294174, 1297157
RP1-140A9.1 1 + 1822910 1824097 1822910, 1823573 1823290, 1824097
SpliceAI throws the error when processing a variant at chr: 1 and pos: 28422847. The variant is harbored by the gene "SPCS2P4". The appended line in grch37.txt corresponding to the variant causing the error is:
SPCS2P4 1 - 28422253 28422933 28422253 28422933
EDIT:
Assuming that the error might be caused by 'single-exon genes' (SEGs) such as SPCS2P4, I removed all lines corresponding to SEGs and the spliceAI run finished successfully!
There are still two questions remaining with this issue:
- What might be the problem with these entries?
- Assuming that the GENCODE V24 annotation was used as described on the main page, why are some of the entries not present in the provided grch37.txt but in gencode.v24lift37.basic.annotation.gtf.gz?
from spliceai.
Here are the answers to your questions:
- SpliceAI also works for single exon genes. We followed the UCSC table browser annotation, which also adds a comma after the last exon. For example:
OR4F5 1 + 69090 70008 69090, 70008,
The parser inside SpliceAI implicitly uses this, and is the reason behind the error that you are facing. Please add a , at the end of each EXON_START and EXON_END entry (this statement applies for multiple exon genes as well). For example,
SPCS2P4 1 - 28422253 28422933 28422253 28422933
should be changed to
SPCS2P4 1 - 28422253 28422933 28422253, 28422933,
and everything should work.
- We further filtered the gene list to include only canonical transcripts (when multiple transcripts are available for the same gene, we picked the one with the largest coding sequence).
from spliceai.
Now everything works fine - thank you very much for your generous help!
from spliceai.
Related Issues (20)
- Lower Accuracy Than Introme HOT 1
- Training with additional Batch Normalization layer producing strange results HOT 1
- Trouble to launch SpliceAI with grch37 HOT 5
- spliceAI not giving output value while running using vep (Variant Ensemble Predictor) HOT 3
- Position of splice sites within an insertion HOT 1
- Training input shape HOT 1
- Question about using snv and indel score files
- variant not scored HOT 5
- Running SpliceAI takes too much time
- Duplicate records in the released VCF file HOT 3
- Unable to install using conda install HOT 1
- Running Short Tandem Repeat genotypes
- build-in grch38 annotation
- How to make a custom annotation set? HOT 2
- No training configuration found in the save file, so the model was *not* compiled. Compile it manually. HOT 3
- spliceai score HOT 3
- Query about spliceai to calculate Delins HOT 1
- WARNING:root:Skipping record (ref too long)
- Way to many TEMP files
- Figure 1c Reproduction HOT 2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from spliceai.