Code Monkey home page Code Monkey logo

customassembly's Introduction

CustomAssembly

This program is a Python script used to reorganize and filter contigs in a Flye DNA assembly.

Overview

This tool is designed to take in a FASTA file containing a Flye assembly as input, and use a configuration file to process and filter the contigs into three output files; a main output file, a second output file containing specified microbial contigs, and a third output file containing specified DNA contigs to be removed from the assembly.

Usage

Input and ouput FASTA file formats

For the FASTA input of CustomAssembly, the sequence names must be in the format of "contig_{sequence name}" or "scaffold_{sequence name}" in order to properly be processed. As well, the FASTA outputs of CustomAssembly will output the sequences name as "contig_{sequence name}".

Running the program

Usage of CustomAssembly is as follows:

python CustomAssembly.py [-h] -i <input> -o <output> -c <config> -m <microbial> -d <deleted>
	
	<input>
		input file in FASTA format (with sequence naming format of "contig_<sequence name>") 
		containing sequences that will be processed
	<output>
		output file in FASTA format containing processed sequences,
		excluding sequences that were specified to be filtered out
	<config>
		input file in .config configuration file format (specified below)
	<microbial>
		output file in FASTA format containing microbial sequences 
		that were specified to be filtered out
	<deleted>
		output file in FASTA format containing other (non-microbial) sequences
		that were specified to be filtered out
options:
	-h, --help
		prints out the above usage statement

While you do not have to specify microbial or other sequences to be removed, you must still provide file paths to all possible output files (include the -m and -d arguments) for CustomAssembly to run properly.

Configuration file format

The configuration file should be formatted as a .config file that specifies five possible operations that can be done on the assembly, each on a new line, and which sequences should be included in the operation.

The operations are as follows:

d: delete - all specified contigs are removed from the main output file and outputted to the <deleted> file.
    format: d:<contig name>,<contig name>,...
i: invert - all specified contigs get inverted (sequence gets reversed and inverted by base pair)
    format: i:<contig name>,<contig name>,...
c: combine - all specified contigs get combined into one large sequence with name <new_contig_name> in the given order, adding * to end of <contig name> indicates to invert that specific contig
    format: c:<new_contig_name>:<contig name>,<contig name>*,...
m: microbial - all specified contigs are removed from the main output file outputted to the <microbial> file.
    format: m:<contig name>,<contig name>,...

An example configuartion file would be:

d:1,5,8,15
i:4,6
c:combined:3,10*,18*,22
m:19,20

The configuration file does not need to include specifications for all operations, and you can have multiple lines for the same operations (for example, you can have multiple combined sequences, can specify sequences to be deleted over multiple lines).

Disclaimer

This tool was primarily used as an internal tool to process D. Melanogaster Flye assemblies with sizes in the hundreds of megabases, and thus has only been tested with FASTA inputs with size 130-150MB. Feel free to modify and/or build upon this code to fit your specific usage and needs.

License

This porgram is licensed under the Apache License Version 2.0. A copy of the Apache 2.0 license can be found here.

customassembly's People

Contributors

avivbenchorin avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.