Code Monkey home page Code Monkey logo

scevt's Introduction

SCEVT LOGO

SCEVT is a tool to easily visualize and analyze scaffolds during de-novo genome assembly.

SCEVT consists of two scripts:

  • scaal.py
  • scaphy.py

scaphy.py (Scaffold to Physical Reference Mapping)

scaphy is a tool to visualize scaffolds in relation to a reference genome assembly. Specifically, it draws gaps within the scaffolds (esspecially helpful for BioNano assisted scaffolds) and draws mappings to a reference chromosome whenever the genes match. It also highlights when a gene is on the scaffold that is not on the specified chromosome on the reference genome (meaning you have probably anchored a new contig).

Here is an example output: example scaphy output

How to use

scaco.py (Scaffold Comparison)

[How it Works] scaco directly compares two scaffolds based on gene annotations. It highlights and maps which genes are similar on the two scaffolds, and also highlights which genes are present on one but not the other. Additionally, it also plots the gaps within the scaffolds. This is useful for comparing haplotype contigs of a de-novo assembly.

Here is an example output: example scaco output

How to use

Installation

Getting the Files

# Go to where you want to have this tool
cd path/to/Project/directory
git clone https://github.com/pbieberstein/SCEVT.git SCEVT

Installing Python & Dependencies

This script was developed in Python 2.7

The easiest way : Install anaconda for python 2.7 on your local machine and then install biopython via:

conda install biopython matplotlib==1.5.3 pandas
conda install --channel bioconda gffutils
# gmap for creating the feature mapping (output needs to be set to BLAT)
conda install -c bioconda gmap

Alternatively, if you want to stay organized we recommend you install miniconda and then create a new virtual environment with the dependencies for this project. (https://conda.io/docs/install/quick.html) (Additional conda help: https://conda.io/docs/_downloads/conda-cheatsheet.pdf)

cd path/to/Project/directory
conda create --prefix ./scevt-env biopython matplotlib==1.5.3 pandas biopython
# This creates a new environment with biopython and matplotlib installed inside the folder "scevt_env"

**It's important to use matplotlib 1.5.3 otherwise SCEVT will run very slowly

Now when you want to run SCEVT, you'll first have to activate this new python environment via:

source activate scevt-env/bin/activate

Now open up a new terminal window to update the PATHs and now you're ready to run scaal and scaphy

Then you can run the tools via

cd path/to/Project/directory/SCEVT
cd Scripts
python scaal.py
# or
python scaphy.py

Progress:

  • scaal.py script is DONE #c5f015
  • scaphy.py script is DONE #c5f015
  • Documentation is DONE #c5f015

This tool was written to assist in a de-novo genome assembly project at ETH-Zurich

It is not activily maintained but it should still be useful. If you have any questions/ideas/concerns, contact me.

scevt's People

Contributors

pbieberstein avatar

Stargazers

johnsonz avatar Mehdi Borji avatar  avatar

Watchers

James Cloos avatar  avatar

scevt's Issues

Gff parser

Use better gff parser so that gff documents are read more robustly... currently the head has to be removed beforehand :/

create example files

create example files that can be used immediatly just to check that software is working

TypeError: 'NoneType' object is not iterable

When using a different genome, psl and gff, the following error occurs:
Traceback (most recent call last):
File "/Users/spascal/PycharmProjects/TestProject2/Philipp/SCEVT/Scripts/scaphy.py", line 605, in
scaff_dict[key].get_reference_mappings(reference_gff_file)
File "/Users/spascal/PycharmProjects/TestProject2/Philipp/SCEVT/Scripts/scaphy.py", line 457, in get_reference_mappings
ref_genome_loc = get_gene_locations_from_gff_db(reference_gff_file) # puts all reference genome locations in one big dictionary... hopefully fast to search through :/
File "/Users/spascal/PycharmProjects/TestProject2/Philipp/SCEVT/Scripts/scaphy.py", line 542, in get_gene_locations_from_gff_db
db = gffutils.FeatureDB(db_name)
File "/Users/spascal/PycharmProjects/TestProject2/venv/lib/python2.7/site-packages/gffutils/interface.py", line 132, in init
version, dialect = c.fetchone()
TypeError: 'NoneType' object is not iterable

I installed Python 2.7.13 on macosx, installed, anaconda, matplotlib (needs an additional file to be set), pandas, biopython, gffutils.

genome_sequence folder name is confusing

Make sure the documentation makes clear that we don't need reference genome sequences... only from the newly assembled sequences the fasta files... and the psl file where we get the coordinates to the reference genome

Assumption of Gene to transcript ID conversion

In you code you assume that cds IDs are "connected" to gene ids via Geneid.1, Geneid.2...
In many organisms this is not the case. Thus your line:
pattern = re.compile('.*(?=.)')
won't work.
There are genomes where there is a non-trivial gene-id to transcript-id to protein-id conversion necessary.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.