Code Monkey home page Code Monkey logo

Comments (7)

kishorejaganathan avatar kishorejaganathan commented on August 14, 2024

Could you open python, do the following and let me know the output?

import pyfasta
fasta = pyfasta.Fasta('examples/human_g1k_v37.fasta')
print fasta.keys()

from spliceai.

xuguorong2016 avatar xuguorong2016 commented on August 14, 2024

Yes, looks no any error when I printed the fasta keys. Please see the below messages

import pyfasta
fasta = pyfasta.Fasta('examples/human_g1k_v37.fasta')
print fasta.keys()
['1 dna:chromosome chromosome:GRCh37:1:1:249250621:1', 'GL000192.1 dna:supercontig supercontig::GL000192.1:1:547496:1', 'GL000239.1 dna:supercontig supercontig::GL000239.1:1:33824:1', 'GL000207.1 dna:supercontig supercontig::GL000207.1:1:4262:1', '16 dna:chromosome chromosome:GRCh37:16:1:90354753:1', 'GL000235.1 dna:supercontig supercontig::GL000235.1:1:34474:1', '2 dna:chromosome chromosome:GRCh37:2:1:243199373:1', '13 dna:chromosome chromosome:GRCh37:13:1:115169878:1', 'GL000210.1 dna:supercontig supercontig::GL000210.1:1:27682:1', 'GL000224.1 dna:supercontig supercontig::GL000224.1:1:179693:1', '4 dna:chromosome chromosome:GRCh37:4:1:191154276:1', '18 dna:chromosome chromosome:GRCh37:18:1:78077248:1', 'GL000241.1 dna:supercontig supercontig::GL000241.1:1:42152:1', 'GL000248.1 dna:supercontig supercontig::GL000248.1:1:39786:1', 'GL000208.1 dna:supercontig supercontig::GL000208.1:1:92689:1', 'GL000243.1 dna:supercontig supercontig::GL000243.1:1:43341:1', 'GL000198.1 dna:supercontig supercontig::GL000198.1:1:90085:1', 'GL000238.1 dna:supercontig supercontig::GL000238.1:1:39939:1', '3 dna:chromosome chromosome:GRCh37:3:1:198022430:1', '8 dna:chromosome chromosome:GRCh37:8:1:146364022:1', '7 dna:chromosome chromosome:GRCh37:7:1:159138663:1', '22 dna:chromosome chromosome:GRCh37:22:1:51304566:1', 'GL000219.1 dna:supercontig supercontig::GL000219.1:1:179198:1', 'GL000211.1 dna:supercontig supercontig::GL000211.1:1:166566:1', 'GL000194.1 dna:supercontig supercontig::GL000194.1:1:191469:1', 'GL000246.1 dna:supercontig supercontig::GL000246.1:1:38154:1', 'GL000236.1 dna:supercontig supercontig::GL000236.1:1:41934:1', 'GL000196.1 dna:supercontig supercontig::GL000196.1:1:38914:1', 'MT gi|251831106|ref|NC_012920.1| Homo sapiens mitochondrion, complete genome', 'GL000232.1 dna:supercontig supercontig::GL000232.1:1:40652:1', 'GL000221.1 dna:supercontig supercontig::GL000221.1:1:155397:1', 'GL000216.1 dna:supercontig supercontig::GL000216.1:1:172294:1', 'GL000245.1 dna:supercontig supercontig::GL000245.1:1:36651:1', 'GL000191.1 dna:supercontig supercontig::GL000191.1:1:106433:1', 'GL000209.1 dna:supercontig supercontig::GL000209.1:1:159169:1', '12 dna:chromosome chromosome:GRCh37:12:1:133851895:1', 'GL000220.1 dna:supercontig supercontig::GL000220.1:1:161802:1', 'GL000217.1 dna:supercontig supercontig::GL000217.1:1:172149:1', '5 dna:chromosome chromosome:GRCh37:5:1:180915260:1', '21 dna:chromosome chromosome:GRCh37:21:1:48129895:1', 'GL000203.1 dna:supercontig supercontig::GL000203.1:1:37498:1', 'GL000225.1 dna:supercontig supercontig::GL000225.1:1:211173:1', 'GL000195.1 dna:supercontig supercontig::GL000195.1:1:182896:1', 'GL000240.1 dna:supercontig supercontig::GL000240.1:1:41933:1', 'GL000242.1 dna:supercontig supercontig::GL000242.1:1:43523:1', 'GL000223.1 dna:supercontig supercontig::GL000223.1:1:180455:1', 'GL000200.1 dna:supercontig supercontig::GL000200.1:1:187035:1', '6 dna:chromosome chromosome:GRCh37:6:1:171115067:1', 'GL000247.1 dna:supercontig supercontig::GL000247.1:1:36422:1', 'GL000202.1 dna:supercontig supercontig::GL000202.1:1:40103:1', 'GL000193.1 dna:supercontig supercontig::GL000193.1:1:189789:1', '10 dna:chromosome chromosome:GRCh37:10:1:135534747:1', '20 dna:chromosome chromosome:GRCh37:20:1:63025520:1', 'GL000197.1 dna:supercontig supercontig::GL000197.1:1:37175:1', 'GL000237.1 dna:supercontig supercontig::GL000237.1:1:45867:1', 'Y dna:chromosome chromosome:GRCh37:Y:2649521:59034049:1', 'GL000213.1 dna:supercontig supercontig::GL000213.1:1:164239:1', 'GL000215.1 dna:supercontig supercontig::GL000215.1:1:172545:1', '11 dna:chromosome chromosome:GRCh37:11:1:135006516:1', 'GL000205.1 dna:supercontig supercontig::GL000205.1:1:174588:1', 'GL000222.1 dna:supercontig supercontig::GL000222.1:1:186861:1', '15 dna:chromosome chromosome:GRCh37:15:1:102531392:1', 'GL000199.1 dna:supercontig supercontig::GL000199.1:1:169874:1', 'GL000249.1 dna:supercontig supercontig::GL000249.1:1:38502:1', 'GL000227.1 dna:supercontig supercontig::GL000227.1:1:128374:1', 'GL000218.1 dna:supercontig supercontig::GL000218.1:1:161147:1', '17 dna:chromosome chromosome:GRCh37:17:1:81195210:1', 'GL000212.1 dna:supercontig supercontig::GL000212.1:1:186858:1', 'GL000226.1 dna:supercontig supercontig::GL000226.1:1:15008:1', 'GL000234.1 dna:supercontig supercontig::GL000234.1:1:40531:1', 'GL000214.1 dna:supercontig supercontig::GL000214.1:1:137718:1', 'GL000233.1 dna:supercontig supercontig::GL000233.1:1:45941:1', 'GL000206.1 dna:supercontig supercontig::GL000206.1:1:41001:1', '19 dna:chromosome chromosome:GRCh37:19:1:59128983:1', 'GL000230.1 dna:supercontig supercontig::GL000230.1:1:43691:1', '9 dna:chromosome chromosome:GRCh37:9:1:141213431:1', 'GL000244.1 dna:supercontig supercontig::GL000244.1:1:39929:1', 'X dna:chromosome chromosome:GRCh37:X:1:155270560:1', 'GL000204.1 dna:supercontig supercontig::GL000204.1:1:81310:1', 'GL000231.1 dna:supercontig supercontig::GL000231.1:1:27386:1', '14 dna:chromosome chromosome:GRCh37:14:1:107349540:1', 'GL000201.1 dna:supercontig supercontig::GL000201.1:1:36148:1', 'GL000229.1 dna:supercontig supercontig::GL000229.1:1:19913:1', 'GL000228.1 dna:supercontig supercontig::GL000228.1:1:129120:1']

from spliceai.

kishorejaganathan avatar kishorejaganathan commented on August 14, 2024

When the software is run on examples/input.vcf, it assumes that the keys of the fasta file are either ['chrX', 'chrY', 'chr1', 'chr2', and so on] or ['X', 'Y', '1', '2', and so on]. Could you try using a more standard fasta file (like the one available in the UCSC genome browser)?

from spliceai.

bw2 avatar bw2 commented on August 14, 2024

I'm also having issues with 'chr' prefixes and getting spliceai to run.
I installed tensorflow and spliceai on MacOSX using pip, then downloaded hg38.fa.gz from
http://hgdownload.cse.ucsc.edu/goldenPath/hg38/bigZips/hg38.fa.gz

I then tried running on whole_genome_filtered_spliceai_scores.vcf.gz as a simple test (even though it's an hg19 vcf):

$ python -m spliceai -R hg38.fa.gz -I whole_genome_filtered_spliceai_scores.vcf.gz -O temp.vcf
Using TensorFlow backend.
Traceback (most recent call last):
  File "/usr/local/Cellar/python@2/2.7.15_1/Frameworks/Python.framework/Versions/2.7/lib/python2.7/runpy.py", line 174, in _run_module_as_main
    "__main__", fname, loader, pkg_name)
  File "/usr/local/Cellar/python@2/2.7.15_1/Frameworks/Python.framework/Versions/2.7/lib/python2.7/runpy.py", line 72, in _run_code
    exec code in run_globals
  File "/usr/local/lib/python2.7/site-packages/spliceai/__main__.py", line 60, in <module>
    main()
  File "/usr/local/lib/python2.7/site-packages/spliceai/__main__.py", line 50, in main
    ann = annotator(args.R, args.A)
  File "/usr/local/lib/python2.7/site-packages/spliceai/utils.py", line 23, in __init__
    self.ref_fasta = pyfasta.Fasta(ref_fasta)
  File "/usr/local/lib/python2.7/site-packages/pyfasta/fasta.py", line 73, in __init__
    flatten_inplace)
  File "/usr/local/lib/python2.7/site-packages/pyfasta/records.py", line 57, in prepare
    for i, (seqid, seq) in enumerate(seqinfo_generator):
  File "/usr/local/lib/python2.7/site-packages/pyfasta/fasta.py", line 110, in gen_seqs_with_headers
    seqs.append(line)
AttributeError: 'NoneType' object has no attribute 'append'

I realized this error is because spliceai/pyfasta can't handle gzipped fasta files, so I unzipped hg38.fa.gz and ran

$ python -m spliceai -R hg38.fa -I whole_genome_filtered_spliceai_scores.vcf.gz -O temp.vcf
Using TensorFlow backend.
2019-01-27 20:17:20.578701: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
/usr/local/lib/python2.7/site-packages/keras/engine/saving.py:292: UserWarning: No training configuration found in save file: the model was *not* compiled. Compile it manually.
  warnings.warn('No training configuration found in save file: '
[W::vcf_parse] Contig '10' is not defined in the header. (Quick workaround: index the file with tabix.)
Segmentation fault: 11

I then realized that all spliceai examples use an uncompressed vcf, so I uncompressed whole_genome_filtered_spliceai_scores.vcf.gz and ran python -m spliceai -R hg38.fa -I whole_genome_filtered_spliceai_scores.vcf -O temp.vcf
but still got a segfault:

$ python -m spliceai -R hg38.fa -I whole_genome_filtered_spliceai_scores.vcf -O temp.vcf
Using TensorFlow backend.
2019-01-27 20:22:48.173551: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
/usr/local/lib/python2.7/site-packages/keras/engine/saving.py:292: UserWarning: No training configuration found in save file: the model was *not* compiled. Compile it manually.
  warnings.warn('No training configuration found in save file: '
[W::vcf_parse] Contig '10' is not defined in the header. (Quick workaround: index the file with tabix.)
Segmentation fault: 11

from spliceai.

kishorejaganathan avatar kishorejaganathan commented on August 14, 2024

There are a couple of things that you are missing right now:

  • The VCF that you are using as input (whole_genome_filtered_spliceai_scores.vcf.gz) does not have some lines in the header which are required by pysam. If you could add these lines to the header of the VCF, you will no longer get the segmentation fault (you could also get these lines from the input example).

##assembly=GRCh37/hg19 ##contig=<ID=1,length=249250621> ##contig=<ID=2,length=243199373> ##contig=<ID=3,length=198022430> ##contig=<ID=4,length=191154276> ##contig=<ID=5,length=180915260> ##contig=<ID=6,length=171115067> ##contig=<ID=7,length=159138663> ##contig=<ID=8,length=146364022> ##contig=<ID=9,length=141213431> ##contig=<ID=10,length=135534747> ##contig=<ID=11,length=135006516> ##contig=<ID=12,length=133851895> ##contig=<ID=13,length=115169878> ##contig=<ID=14,length=107349540> ##contig=<ID=15,length=102531392> ##contig=<ID=16,length=90354753> ##contig=<ID=17,length=81195210> ##contig=<ID=18,length=78077248> ##contig=<ID=19,length=59128983> ##contig=<ID=20,length=63025520> ##contig=<ID=21,length=48129895> ##contig=<ID=22,length=51304566> ##contig=<ID=X,length=155270560> ##contig=<ID=Y,length=59373566>

  • Second, the VCF that you are using (+ the default annotation file) corresponds to hg19/GRCh37. So please download and use the hg19 fasta file instead. I believe UCSC provides them separately for each chromosome, so you might have to concatenate them.

from spliceai.

david-a-parry avatar david-a-parry commented on August 14, 2024

When the software is run on examples/input.vcf, it assumes that the keys of the fasta file are either ['chrX', 'chrY', 'chr1', 'chr2', and so on] or ['X', 'Y', '1', '2', and so on]. Could you try using a more standard fasta file (like the one available in the UCSC genome browser)?

It might be helpful for a lot of people to use pyfaidx (https://github.com/mdshw5/pyfaidx) rather than pyfasta as a lot of people will be using references like those recommended by Heng Li (http://lh3.github.io/2017/11/13/which-human-reference-genome-to-use) or as provided by the GATK resource bundle. These references all contain fields after the contig name in the FASTA headers which cause pyfasta to fail, while pyfaidx handles these as expected.

from spliceai.

kishorejaganathan avatar kishorejaganathan commented on August 14, 2024

Thanks for the suggestion, the current release (v1.2) uses pyfaidx and also handles the 1, 'chr1' chromosome naming mismatch automatically.

from spliceai.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.