Code Monkey home page Code Monkey logo

fasta2genotype's Introduction

fasta2genotype

This program takes a fasta file listing all sequence haplotypes of all individuals at all loci as well as a list of individuals/populations and list of white loci then outputs data in one of eight formats:

(1) migrate-n, (2) Arlequin, (3) DIYabc, (4) LFMM, (5) Phylip, (6) G-Phocs, or (7) Treemix (8) Additionally, the data can be coded as unique sequence integers (haplotypes) in Structure, Genepop, SamBada, Bayescan, Arlequin, GenAlEx format, or summarized as allele frequencies by population.

Execute program in the following way:

python fasta2genotype.py [fasta file] [whitelist file] [population file] [VCF file] [output name]

Python Version:

NOTE: This script was written for Python 2.7* which is retired as of Jan. 1, 2020. You can still run this script with Python 2.7, just note that only Python 3 is maintained at this point.

Quality filtering options:

  • In addition to coverage filtering, several other quality control measures can be selected.
  • Monomorphic loci can be removed.
  • Loci suspected of being paralogs (assembled as one ‘Frankenstein’ locus, but including alleles from different loci) can be removed. Loci surpassing the desired threshold value of heterozygosity are tested for Hardy-Weinberg genotype proportions. If they fail (based on a chi-squared test), and have higher heterozygosity than expected under Hardy-Weinberg, they are removed.
  • Alleles under a locus-wide frequency can be removed
  • Alleles represented under a given frequency of populations can be removed (e.g. 4 pops of 16, frequency of 0.25)
  • Loci can be removed from populations where those populations fall under a missing data threshold for those loci
  • Loci under a given locus-wide missing data threshold can be removed.
  • Individuals missing a threshold frequency of loci can be removed.
  • Restriction enzyme sites can be removed if requested, for single- or double-digest setups. Simply select this option and provide the 5' and/or 3’ sequence(s). There can be multiple sequences for either 5’ or 3’ end. For example, you might have double-digest RAD data with READ 1 and READ 2 in the same fasta file. You can tell the program to remove either ‘TGCAGG’ or ‘CGG’ depending on which is found on the 5’ end of a given sequence. Regardless of restriction enzymes or adapters used, it might help to examine the fasta file before choosing this option. Your sequences might be reversed or reverse-complemented.
  • If you are exporting an Phylip alignment file, there are several options:
  1. SNPs or sequences: concatenate just the polymorphic sites (SNPs), or full sequences, including invariable sites.
  2. The alignment can be summarized at the level of haplotypes, individuals, or populations, using IUPAC ambiguity codes.
  3. You can select a subset of loci that are “phylogenetically informative” (PI), meaning fixed for alternate alleles at 2+ taxa, or simply “fixed” (at 1+ taxa).
  4. If you choose the option for “PI”/”fixed” loci, you will probably want to output SNPs rather than full sequences, since you probably want only PI or fixed sites. However, if you choose “PI”/”fixed” loci and “full sequences,” then the program will output complete sequences containing at least 1 PI or fixed SNP.
  5. The alignment can be separated by locus with “!” symbols.
  6. A header of tab-delimited locus names can be added.

Citation:

Please cite in the following way:

Maier P.A., Vandergast A.G., Ostoja S.M., Aguilar A., Bohonak A.J. (2019). Pleistocene glacial cycles drove lineage diversification and fusion in the Yosemite toad (Anaxyrus canorus). Evolution, in press. https://www.doi.org/10.1111/evo.13868

fasta2genotype's People

Contributors

paulmaier avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.