Comments (7)
Could you open python, do the following and let me know the output?
import pyfasta
fasta = pyfasta.Fasta('examples/human_g1k_v37.fasta')
print fasta.keys()
from spliceai.
Yes, looks no any error when I printed the fasta keys. Please see the below messages
import pyfasta
fasta = pyfasta.Fasta('examples/human_g1k_v37.fasta')
print fasta.keys()
['1 dna:chromosome chromosome:GRCh37:1:1:249250621:1', 'GL000192.1 dna:supercontig supercontig::GL000192.1:1:547496:1', 'GL000239.1 dna:supercontig supercontig::GL000239.1:1:33824:1', 'GL000207.1 dna:supercontig supercontig::GL000207.1:1:4262:1', '16 dna:chromosome chromosome:GRCh37:16:1:90354753:1', 'GL000235.1 dna:supercontig supercontig::GL000235.1:1:34474:1', '2 dna:chromosome chromosome:GRCh37:2:1:243199373:1', '13 dna:chromosome chromosome:GRCh37:13:1:115169878:1', 'GL000210.1 dna:supercontig supercontig::GL000210.1:1:27682:1', 'GL000224.1 dna:supercontig supercontig::GL000224.1:1:179693:1', '4 dna:chromosome chromosome:GRCh37:4:1:191154276:1', '18 dna:chromosome chromosome:GRCh37:18:1:78077248:1', 'GL000241.1 dna:supercontig supercontig::GL000241.1:1:42152:1', 'GL000248.1 dna:supercontig supercontig::GL000248.1:1:39786:1', 'GL000208.1 dna:supercontig supercontig::GL000208.1:1:92689:1', 'GL000243.1 dna:supercontig supercontig::GL000243.1:1:43341:1', 'GL000198.1 dna:supercontig supercontig::GL000198.1:1:90085:1', 'GL000238.1 dna:supercontig supercontig::GL000238.1:1:39939:1', '3 dna:chromosome chromosome:GRCh37:3:1:198022430:1', '8 dna:chromosome chromosome:GRCh37:8:1:146364022:1', '7 dna:chromosome chromosome:GRCh37:7:1:159138663:1', '22 dna:chromosome chromosome:GRCh37:22:1:51304566:1', 'GL000219.1 dna:supercontig supercontig::GL000219.1:1:179198:1', 'GL000211.1 dna:supercontig supercontig::GL000211.1:1:166566:1', 'GL000194.1 dna:supercontig supercontig::GL000194.1:1:191469:1', 'GL000246.1 dna:supercontig supercontig::GL000246.1:1:38154:1', 'GL000236.1 dna:supercontig supercontig::GL000236.1:1:41934:1', 'GL000196.1 dna:supercontig supercontig::GL000196.1:1:38914:1', 'MT gi|251831106|ref|NC_012920.1| Homo sapiens mitochondrion, complete genome', 'GL000232.1 dna:supercontig supercontig::GL000232.1:1:40652:1', 'GL000221.1 dna:supercontig supercontig::GL000221.1:1:155397:1', 'GL000216.1 dna:supercontig supercontig::GL000216.1:1:172294:1', 'GL000245.1 dna:supercontig supercontig::GL000245.1:1:36651:1', 'GL000191.1 dna:supercontig supercontig::GL000191.1:1:106433:1', 'GL000209.1 dna:supercontig supercontig::GL000209.1:1:159169:1', '12 dna:chromosome chromosome:GRCh37:12:1:133851895:1', 'GL000220.1 dna:supercontig supercontig::GL000220.1:1:161802:1', 'GL000217.1 dna:supercontig supercontig::GL000217.1:1:172149:1', '5 dna:chromosome chromosome:GRCh37:5:1:180915260:1', '21 dna:chromosome chromosome:GRCh37:21:1:48129895:1', 'GL000203.1 dna:supercontig supercontig::GL000203.1:1:37498:1', 'GL000225.1 dna:supercontig supercontig::GL000225.1:1:211173:1', 'GL000195.1 dna:supercontig supercontig::GL000195.1:1:182896:1', 'GL000240.1 dna:supercontig supercontig::GL000240.1:1:41933:1', 'GL000242.1 dna:supercontig supercontig::GL000242.1:1:43523:1', 'GL000223.1 dna:supercontig supercontig::GL000223.1:1:180455:1', 'GL000200.1 dna:supercontig supercontig::GL000200.1:1:187035:1', '6 dna:chromosome chromosome:GRCh37:6:1:171115067:1', 'GL000247.1 dna:supercontig supercontig::GL000247.1:1:36422:1', 'GL000202.1 dna:supercontig supercontig::GL000202.1:1:40103:1', 'GL000193.1 dna:supercontig supercontig::GL000193.1:1:189789:1', '10 dna:chromosome chromosome:GRCh37:10:1:135534747:1', '20 dna:chromosome chromosome:GRCh37:20:1:63025520:1', 'GL000197.1 dna:supercontig supercontig::GL000197.1:1:37175:1', 'GL000237.1 dna:supercontig supercontig::GL000237.1:1:45867:1', 'Y dna:chromosome chromosome:GRCh37:Y:2649521:59034049:1', 'GL000213.1 dna:supercontig supercontig::GL000213.1:1:164239:1', 'GL000215.1 dna:supercontig supercontig::GL000215.1:1:172545:1', '11 dna:chromosome chromosome:GRCh37:11:1:135006516:1', 'GL000205.1 dna:supercontig supercontig::GL000205.1:1:174588:1', 'GL000222.1 dna:supercontig supercontig::GL000222.1:1:186861:1', '15 dna:chromosome chromosome:GRCh37:15:1:102531392:1', 'GL000199.1 dna:supercontig supercontig::GL000199.1:1:169874:1', 'GL000249.1 dna:supercontig supercontig::GL000249.1:1:38502:1', 'GL000227.1 dna:supercontig supercontig::GL000227.1:1:128374:1', 'GL000218.1 dna:supercontig supercontig::GL000218.1:1:161147:1', '17 dna:chromosome chromosome:GRCh37:17:1:81195210:1', 'GL000212.1 dna:supercontig supercontig::GL000212.1:1:186858:1', 'GL000226.1 dna:supercontig supercontig::GL000226.1:1:15008:1', 'GL000234.1 dna:supercontig supercontig::GL000234.1:1:40531:1', 'GL000214.1 dna:supercontig supercontig::GL000214.1:1:137718:1', 'GL000233.1 dna:supercontig supercontig::GL000233.1:1:45941:1', 'GL000206.1 dna:supercontig supercontig::GL000206.1:1:41001:1', '19 dna:chromosome chromosome:GRCh37:19:1:59128983:1', 'GL000230.1 dna:supercontig supercontig::GL000230.1:1:43691:1', '9 dna:chromosome chromosome:GRCh37:9:1:141213431:1', 'GL000244.1 dna:supercontig supercontig::GL000244.1:1:39929:1', 'X dna:chromosome chromosome:GRCh37:X:1:155270560:1', 'GL000204.1 dna:supercontig supercontig::GL000204.1:1:81310:1', 'GL000231.1 dna:supercontig supercontig::GL000231.1:1:27386:1', '14 dna:chromosome chromosome:GRCh37:14:1:107349540:1', 'GL000201.1 dna:supercontig supercontig::GL000201.1:1:36148:1', 'GL000229.1 dna:supercontig supercontig::GL000229.1:1:19913:1', 'GL000228.1 dna:supercontig supercontig::GL000228.1:1:129120:1']
from spliceai.
When the software is run on examples/input.vcf, it assumes that the keys of the fasta file are either ['chrX', 'chrY', 'chr1', 'chr2', and so on] or ['X', 'Y', '1', '2', and so on]. Could you try using a more standard fasta file (like the one available in the UCSC genome browser)?
from spliceai.
I'm also having issues with 'chr' prefixes and getting spliceai to run.
I installed tensorflow and spliceai on MacOSX using pip, then downloaded hg38.fa.gz
from
http://hgdownload.cse.ucsc.edu/goldenPath/hg38/bigZips/hg38.fa.gz
I then tried running on whole_genome_filtered_spliceai_scores.vcf.gz as a simple test (even though it's an hg19 vcf):
$ python -m spliceai -R hg38.fa.gz -I whole_genome_filtered_spliceai_scores.vcf.gz -O temp.vcf
Using TensorFlow backend.
Traceback (most recent call last):
File "/usr/local/Cellar/python@2/2.7.15_1/Frameworks/Python.framework/Versions/2.7/lib/python2.7/runpy.py", line 174, in _run_module_as_main
"__main__", fname, loader, pkg_name)
File "/usr/local/Cellar/python@2/2.7.15_1/Frameworks/Python.framework/Versions/2.7/lib/python2.7/runpy.py", line 72, in _run_code
exec code in run_globals
File "/usr/local/lib/python2.7/site-packages/spliceai/__main__.py", line 60, in <module>
main()
File "/usr/local/lib/python2.7/site-packages/spliceai/__main__.py", line 50, in main
ann = annotator(args.R, args.A)
File "/usr/local/lib/python2.7/site-packages/spliceai/utils.py", line 23, in __init__
self.ref_fasta = pyfasta.Fasta(ref_fasta)
File "/usr/local/lib/python2.7/site-packages/pyfasta/fasta.py", line 73, in __init__
flatten_inplace)
File "/usr/local/lib/python2.7/site-packages/pyfasta/records.py", line 57, in prepare
for i, (seqid, seq) in enumerate(seqinfo_generator):
File "/usr/local/lib/python2.7/site-packages/pyfasta/fasta.py", line 110, in gen_seqs_with_headers
seqs.append(line)
AttributeError: 'NoneType' object has no attribute 'append'
I realized this error is because spliceai/pyfasta can't handle gzipped fasta files, so I unzipped hg38.fa.gz and ran
$ python -m spliceai -R hg38.fa -I whole_genome_filtered_spliceai_scores.vcf.gz -O temp.vcf
Using TensorFlow backend.
2019-01-27 20:17:20.578701: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
/usr/local/lib/python2.7/site-packages/keras/engine/saving.py:292: UserWarning: No training configuration found in save file: the model was *not* compiled. Compile it manually.
warnings.warn('No training configuration found in save file: '
[W::vcf_parse] Contig '10' is not defined in the header. (Quick workaround: index the file with tabix.)
Segmentation fault: 11
I then realized that all spliceai examples use an uncompressed vcf, so I uncompressed whole_genome_filtered_spliceai_scores.vcf.gz
and ran python -m spliceai -R hg38.fa -I whole_genome_filtered_spliceai_scores.vcf -O temp.vcf
but still got a segfault:
$ python -m spliceai -R hg38.fa -I whole_genome_filtered_spliceai_scores.vcf -O temp.vcf
Using TensorFlow backend.
2019-01-27 20:22:48.173551: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
/usr/local/lib/python2.7/site-packages/keras/engine/saving.py:292: UserWarning: No training configuration found in save file: the model was *not* compiled. Compile it manually.
warnings.warn('No training configuration found in save file: '
[W::vcf_parse] Contig '10' is not defined in the header. (Quick workaround: index the file with tabix.)
Segmentation fault: 11
from spliceai.
There are a couple of things that you are missing right now:
- The VCF that you are using as input (whole_genome_filtered_spliceai_scores.vcf.gz) does not have some lines in the header which are required by pysam. If you could add these lines to the header of the VCF, you will no longer get the segmentation fault (you could also get these lines from the input example).
##assembly=GRCh37/hg19 ##contig=<ID=1,length=249250621> ##contig=<ID=2,length=243199373> ##contig=<ID=3,length=198022430> ##contig=<ID=4,length=191154276> ##contig=<ID=5,length=180915260> ##contig=<ID=6,length=171115067> ##contig=<ID=7,length=159138663> ##contig=<ID=8,length=146364022> ##contig=<ID=9,length=141213431> ##contig=<ID=10,length=135534747> ##contig=<ID=11,length=135006516> ##contig=<ID=12,length=133851895> ##contig=<ID=13,length=115169878> ##contig=<ID=14,length=107349540> ##contig=<ID=15,length=102531392> ##contig=<ID=16,length=90354753> ##contig=<ID=17,length=81195210> ##contig=<ID=18,length=78077248> ##contig=<ID=19,length=59128983> ##contig=<ID=20,length=63025520> ##contig=<ID=21,length=48129895> ##contig=<ID=22,length=51304566> ##contig=<ID=X,length=155270560> ##contig=<ID=Y,length=59373566>
- Second, the VCF that you are using (+ the default annotation file) corresponds to hg19/GRCh37. So please download and use the hg19 fasta file instead. I believe UCSC provides them separately for each chromosome, so you might have to concatenate them.
from spliceai.
When the software is run on examples/input.vcf, it assumes that the keys of the fasta file are either ['chrX', 'chrY', 'chr1', 'chr2', and so on] or ['X', 'Y', '1', '2', and so on]. Could you try using a more standard fasta file (like the one available in the UCSC genome browser)?
It might be helpful for a lot of people to use pyfaidx (https://github.com/mdshw5/pyfaidx) rather than pyfasta as a lot of people will be using references like those recommended by Heng Li (http://lh3.github.io/2017/11/13/which-human-reference-genome-to-use) or as provided by the GATK resource bundle. These references all contain fields after the contig name in the FASTA headers which cause pyfasta to fail, while pyfaidx handles these as expected.
from spliceai.
Thanks for the suggestion, the current release (v1.2) uses pyfaidx and also handles the 1, 'chr1' chromosome naming mismatch automatically.
from spliceai.
Related Issues (20)
- Transcript Dependent Scores HOT 1
- Delta position seems to be wrong for this variant HOT 1
- Interpret SpliceAI result
- Lower Accuracy Than Introme HOT 1
- Training with additional Batch Normalization layer producing strange results HOT 1
- Trouble to launch SpliceAI with grch37 HOT 5
- spliceAI not giving output value while running using vep (Variant Ensemble Predictor) HOT 3
- Position of splice sites within an insertion HOT 1
- Training input shape HOT 1
- Question about using snv and indel score files
- variant not scored HOT 5
- Running SpliceAI takes too much time
- Duplicate records in the released VCF file HOT 3
- Unable to install using conda install HOT 1
- Running Short Tandem Repeat genotypes
- build-in grch38 annotation
- How to make a custom annotation set? HOT 2
- No training configuration found in the save file, so the model was *not* compiled. Compile it manually. HOT 3
- spliceai score HOT 3
- Query about spliceai to calculate Delins HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from spliceai.