Code Monkey home page Code Monkey logo

hgtsim's Introduction

logo

pypi  licence  pypi  version  pypi  download DOI

Publication

Workflow

workflow

Dependencies

Change Log

  • 2019-01-06:

    • HgtSIM can be installed with "pip3 install HgtSIM" now.
  • 2018-04-06:

    • combined the '-mixed', '-mini' and '-maxi' options into one: '-mixed min-max'.
  • 2017-09-16:

    • add support for draft genome.
    • add support for dynamic flanking sequences.
    • add support for the 'mixed' mode.
    • add support for the 'keep_cds' option.

To-do

  • run Prodigal if "-keep_cds" was specified
  • check Ns in provided gene sequences
  • check whether provided sequences to transfer are ORFs, exit if not

Installation

  • HgtSIM is implemented in python3, you can install it with:

      pip3 install HgtSIM
    
  • HgtSIM requires BLAST+, you can either add it to your system path or specify full path to "blastn" and "blastp" executables with options "-blastn" and "-blastp".

Help information

    HgtSIM -h

      -t          sequences of genes to be transferred (multi-fasta format)
      -i          mutation level
      -d          distribution of transfers to the recipient genomes
      -f          folder holds recipient genomes
      -r          ratio of mutation types
      -x          file extension of recipient genomes
      -lf         left end flanking sequences
      -rf         right end flanking sequences
      -mixed      randomly assign mutation levels between specified values, parameter format: min-max
      -keep_cds   insert transfers only to non-coding regions, need the annotation files (in gbk format) of recipient genomes
      -a          folder holds the annotation files (in gbk format) of recipient genomes
      -l          minimum length of intergenic region to be considered for insertion
      -blastn     path to blastn executable, default: blastn
      -blastp     path to blastp executable, default: blastp

Input files and arguments

  1. Sequences of genes to be transferred (in multi-fasta format).

  2. A folder holds all recipient genomes, one file per genome.

  3. The mutation level of genes to be transferred. This can be specified either as a fixed value, or within a range (the 'mixed' mode). If the 'mixed' argument was provided, HgtSIM will randomly select a value between user specified minimum and maximum mutation levels to alter each gene transfer.

     # with fixed mutation level (e.g. 10%).
     HgtSIM -t genes.fasta -d distribution.txt -f input_genomes -r 1-0-1-1 -x fna -i 10
    
     # with 'mixed' mode (e.g. 5-25%)
     HgtSIM -t genes.fasta -d distribution.txt -f input_genomes -r 1-0-1-1 -x fna -mixed 5-25
    
  4. The ratio of mutation categories (separated with dash). The default setting is '1-0-1-1'. Please refer to the publication (http://dx.doi.org/10.7717/peerj.4015) or the figure below for its setting.

    ratio_selection

  5. The distribution of transfers to the recipient genomes. The first column refers to the recipient genomes(without file extension), followed by a list of genes to be transferred therein (separated with comma).

     BAD,AAM_03063,AKV_01007,AMAC_01196,AMAU_02632,AMS_01785
     BDS,AAM_00175,AKV_00943,AMAC_00215,AMAU_02085,AMS_01465
     BGC,AAM_00176,AKV_01272,AMAC_01576,AMAU_00617,AMS_02653
     BHS,AAM_00195,AKV_01273,AMAC_01674,AMAU_05963,AMS_03303
     BNM,AAM_00209,AKV_00282,AMAC_02914,AMAU_02414,AMS_03378
     BRT,AAM_00308,AKV_02353,AMAC_03303,AMAU_00830,AMS_01655
    
  6. The flanking sequences to be added to the end of gene transfers. Can be specified with '-lf' and '-rf', the default value is None.

     # introduce gene transfers without adding flanking sequences
     HgtSIM -t genes.fasta -i 10 -d distribution.txt -f input_genomes -r 1-0-1-1 -x fna
    
     # or, add same pair of flanking sequences (e.g. 'TAGATGAGTGATTAGTTAGTTA') to all gene transfers
     HgtSIM -t genes.fasta -i 10 -d distribution.txt -f input_genomes -r 1-0-1-1 -x fna -lf TAGATGAGTGATTAGTTAGTTA -rf TAGATGAGTGATTAGTTAGTTA
    
     # or, add flanking sequences dynamically to the two ends of each gene transfer
     HgtSIM -t genes.fasta -i 10 -d distribution.txt -f input_genomes -r 1-0-1-1 -x fna -lf lf.fasta -rf rf.fasta
    

    if you want to add flanking sequences dynamically to the gene transfers, you can specify the left and right side sequences in two multi-fasta files. The IDs of the flanking sequences need to be exactly the same to their corresponding gene transfers.

    As an illustration, if you have four transfers, which are transfer_A, transfer_B, transfer_C and transfer_D. And you have provided the following two files:

    lf.fasta

     >transfer_A
     AAAAAAAAAA
     >transfer_B
     TTT
    

    rf.fasta

     >transfer_A
     GGGGGGG
     >transfer_C
     CCCCC
    

    HgtSIM will then:

    1. add 'AAAAAAAAAA' to the left and 'GGGGGGG' to the right end of transfer_A;
    2. add 'TTT' to the left and nothing to the right end of transfer_B;
    3. add nothing to the left and 'CCCCC' to the right end of transfer_C;
    4. add nothing to boths end of transfer_D.
  7. Transfers can be inserted only to the intergenic regions by specifying the 'keep_cds' option. The annotation files (in genbank format) of the recipient genomes are needed to enable this option.

Output files

  1. Produced genomes with transferred genes, which were placed in folder 'Genomes_with_transfers'.
  2. The amino acid sequences of input genes to be transferred.
  3. The nucleotide and amino acid sequences of mutated input genes.
  4. The mutation report file, which includes two parts:
    1. on the top is the nc and aa identities between input and mutated sequences for each transfer.
    2. followed by a summary of changed nucleotide bases for each transfer.
  5. The insertion report file.

hgtsim's People

Contributors

songweizhi avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

Forkers

xuesap

hgtsim's Issues

KeyError: 'TAG'

Hello developers,
I'm trying to simulate some HGT by transfering several STX elements into a Ecoli_K12 genome, but I have entountered a runtime error below:

2018-12-11 17:18:22 Mixed mode set to False
2018-12-11 17:18:22 Running random mutation
Traceback (most recent call last):
  File "HgtSIM.py", line 434, in <module>
    codon_to_aa_dict[current_codon_for_samesense_mutation],
KeyError: 'TAG'

I guess that there are some error in my sequences_of_transfers.fasta, but I don't know what exact error it is, could you please help me?

IndexError: list index out of range

Hello,

I am receiving this error when trying to run your program. I am trying to simulate an HGT event between the bacteria Wolbachia and its host Drosophila. My command line is this:

$ python HgtSIM.py -t Wolbachia.fasta -i 0 -d distribution_of_transfers.txt -f recipient_genome -x fasta
My recipient_genome folder contains a Drosophila multi-fasta sequence called Drosophila_test_genome (only part of the genome, not the entire thing). My distribution_of_transfers.txt is a text file with the line "Drosophila_test_genome, Wolbachia". [I eventually want to use the entire Drosophila genome as a recipient genome along with the keep_cds flag, but thought to try out something simpler at first.]

Here is the error that I get:

Running random mutation...
Processing: Wolbachia
Mutants of input sequences exported to: /Users/mjw349/Desktop/HgtSIM-master/test/outputs_0_1-0-1-1/input_sequence_mutant_nc.fasta
Get nucleotide identity (by BlastN) between input sequences and their mutants
Get amino acids identity (by BlastP) between input sequences and their mutants
Random mutation report exported to: /Users/mjw349/Desktop/HgtSIM-master/test/outputs_0_1-0-1-1/Step_1_mutation_report.txt
dynamic_flanking_seqs: False
Running random insertion...
Traceback (most recent call last):
  File "../HgtSIM.py", line 667, in <module>
    new_seq, for_report = get_random_insertion(keep_cds, recipient_ctg_nc, insert_sequence_seq_list_new, insert_sequence_id_list_new, dynamic_flanking_seqs, lf_dict, rf_dict, common_stop_sequence_l, common_stop_sequence_r, recipient_genome_id, recipient_ctg_id, recipient_ctg_intergenic_list)
  File "../HgtSIM.py", line 136, in get_random_insertion
    first_sequence = [1, random_intergenics_sorted[n]]
IndexError: list index out of range

When I use the example given with the program, the example runs fine. So I believe my files are not set up properly.

Any insight would be appreciated.
Many thanks!
Sincerely,
Miwa

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.