Code Monkey home page Code Monkey logo

mmgenome2's Introduction

R-CMD-check

Tools for extracting individual genomes from metagenomes

mmgenome2 is an R-package designed to facilitate reproducible extraction of individual genomes from metagenomes. It is an implementation of the different binning strategies described in the multi-metagenome project, and makes it possible to apply these to any metagenome data. In combination with the RMarkdown format, the mmgenome2 package allows for reproducible step-by-step extraction of high-quality genomes from metagenomes as well as generating publication-ready figures with minimal effort.

Installation

First, install R (3.4.3 or later) and RStudio. Windows users should also install RTools, make sure the PATH checkbox is checked during installation. Then open RStudio as administrator (!) and run the following commands (just copy/paste) to install mmgenome2 from the console:

#check for remotes
if(!require(remotes))
  install.packages("remotes")
  
#install mmgenome2 using remotes
remotes::install_github("kasperskytte/mmgenome2")

Installation on MAC

To install mmgenome2 on MAC please see this before running the above commands.

Get started

For a brief guide about the basics of mmgenome2 go to the Get Started page.

mmgenome2's People

Contributors

kasperskytte avatar kirk3gaard avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

mmgenome2's Issues

mmplot - selection

I am having troubles getting back the coordinates when doing the "selector" argument with mmplot. I only have one read set of my sample, so I am plotting the GC value vs Coverage:

mmplot(mm, x=mm$cov_testfile.fasta, y=mm$gc, x_scale = "log10", y_limits = c(50,75), x_limits = c(10,10000), locator = TRUE)

When running this line, I can draw my polygon, but after clicking Finish, only the following error message pops up:

Error in names(x) <- value : 
  'names' attribute [54343] must be the same length as the vector [2]

mmplot to support label colors

It would be awesome to be able to add labels with different colours and if possible more than one label for each contig. This would support that I could label contig with a 16S rRNA gene with their classification in one colour and label contig with lets say a mcrA gene in another colour. A contig with both an mcrA and 16S rRNA gene would thus have two separate labels in different colours.

Umap instead of tsne as non-linear dimension reduction

Dear mmgenome2 team,

I recently noticed that t-sne does not preserve the global structure of the original data. So it is possible that for some genome species, we will lose the global information. UMAP was designed to preserve the global structure. The R implementation (uwot) is pretty fast compare to Rtsne-multicore, which is a plus to UMAP. Does it worth some efforts to also add UMAP support?

Thanks,

Jianshu

mmplot_cov_profiles to include scaffold labels

Hi

It would be convenient if scaffold names are somehow passed to the points in mmplot_cov_profiles so that it is possible to investigate which scaffolds belong together (using hovering with plotly), without having to make a second plot.

Best regards
Rasmus

mmplot_cov_profiles

Normalise option should maybe accept a second mm object. If the point is to get a fraction of the entire community it is important that the full assembly is provided. This would also mitigate that coverage correlates simply due to differences in sequencing depth across samples.

Can you put a link in the home page about where the data for the example comes from?

Hi there,
I'm hoping to use this mmgenome2 to teach the basics of metagenomics to undergrads in an upcoming workshop, and I wondered if you would be able to tell me where the data for the mmgenome2 data object came from? I'm also keen to get hold of the short read data that generated the dataset, so I can walk them through how to make the coverage calculations (but I'm guessing I will be able to grab that once I know which paper it was from).

Great tool!

Best wishes,
Ben

install on mac os, R 3.6.1

done as the website https://kasperskytte.github.io/mmgenome2/articles/MACinstall.html

  1. while an error at the 4th step
    run: brew install gcc --without-multilib
    --without-multilib would raise an error and i delete it
    the gcc installed is version 9, so I change 7 with 9 in the file Makevars

  2. Then I open R with sudo open -a R, do as the guide:

>remotes::install_github("kasperskytte/mmgenome2")
Downloading GitHub repo kasperskytte/mmgenome2@master
Skipping 6 packages ahead of CRAN: Biostrings, BiocGenerics, S4Vectors, IRanges, XVector, zlibbioc
Running R CMD build...

  • checking for file ‘/private/var/folders/q_/3k0tthbx72jdw_7jy701dtl00000gn/T/Rtmpb5HQ0i/remotes256a7b8c0a21/KasperSkytte-mmgenome2-7b48038/DESCRIPTION’ ... OK
  • preparing ‘mmgenome2’:
  • checking DESCRIPTION meta-information ... OK
  • checking for LF line-endings in source and make files and shell scripts
  • checking for empty or unneeded directories
  • looking to see if a ‘data/datalist’ file should be added
  • building ‘mmgenome2_2.1.tar.gz’
  • installing source package ‘mmgenome2’ ...
    ** using staged installation
    ** R
    ** data
    *** moving datasets to lazyload DB
    ** byte-compile and prepare package for lazy loading
    Note: wrong number of arguments to 'sqrt'
    Note: wrong number of arguments to 'floor'
    ** help
    *** installing help indices
    ** building package indices
    ** installing vignettes
    ** testing if installed package can be loaded from temporary location
    ** testing if installed package can be loaded from final location
    ** testing if installed package keeps a record of temporary installation path
  • DONE (mmgenome2)

>library(mmgenome2)
Error: package or namespace load failed for ‘mmgenome2’:
.onAttach failed in attachNamespace() for 'mmgenome2', details:
call: fun(libname, pkgname)
error: object 'remote_version' not found
In addition: Warning message:
In file(con, "r") :
URL 'https://raw.githubusercontent.com/kasperskytte/mmgenome2/master/DESCRIPTION': status was 'Couldn't connect to server'

while the installation is done, but i can't library it.
could anyone give some help?
Thanks.

mmextract does not work with integers

When plotting something that has integer values e.g. length the selection tool breaks when using mmextract. e.g.

sel<-data.frame(length = c(3186796.968, 3139304.135, -53787.213, -42302.204),
           cov_I_01 = c(198.8, 279.547, 280.235, 223.964))

mmplot(mm = d_01,locator = F,selection = sel,
       x = "length",
       y = "cov_I_01"
       ) 
d_01_subset<-mmextract(mm = d_01,selection = sel)

Gives the error:

Error in sp::point.in.polygon(point.x = mms[[colnames(selection)[1]]], : REAL() can only be applied to a 'numeric', not a 'integer'

mmnetwork breaks when running on big datasets

When running mmnetwork with big networks I have encountered an error:

mmnetwork(mm = d, network =links) Warning message: In as.list.default(X) : reached elapsed time limit

A plot is generated but not with a network, it could possibly be fixed by allowing the user ti specify a new time limit.

mmnetwork_fail

Installing `mmgenome2` with R 3.5.0

Known installation issues with R 3.5.0

Windows:

  • Currently only data.table creating problems with installation
  • fix: run install.packages("data.table", repos="https://Rdatatable.github.io/data.table") before remotes::install_github("kasperskytte/mmgenome2")

MAC:

  • No problem installing

Linux

  • No problem installing with R 3.5.0 installed from source

mmexport to make linux line ends

It would be useful if it was possible to specify mmexport to make linux line ends as a lot of tools are not happy with the windows line ends and changing them in e.g. notepad++ occasionally results in weird errors.

mmgenome2 installing with R 3.6.0

Hi, I have very new version of the R sotware 3.6.0. I tried to install the mmgenome2 on it. But it seems several of the required packages are not compatible with the latest version. So I have very hard time installing the mmgenome2. Could you pleas help me in this regard! your help is very much appreciated. Thanks in advance.

specify color range

When setting color_by as "gc" it is nice to get a broad selection of colours on the contigs when they vary a lot e.g. 20-70. But when you have a final bin with very similar gc content 58-60 it makes little sense to have the entire colour spectrum as the contigs will look like they have extremely different gc, when they in fact do not.

It would thus be nice to set a range for the values to be coloured or the range could just be 0-100 for gc no matter what unless the user specifies a different range using normal ggplot settings

shiny not supported on M1

Thank you for using mmgenome2!

If you encounter a problem with mmgenome2 that you are unable to resolve, this is the place for discussing the problem with developers to find a solution. To best help you quickly, please first make sure that:

  • you have the latest version of the package installed
  • you have carefully read the documentation of the mmgenome2 function(s) used
  • you have searched the GitHub issues page or the internet for similar or identical problems

Should the problem still persist, then please provide a minimal reproducible example, preferably using reprex, with self-contained code so that anyone can copy the code and reproduce the problem to easily start debugging without further communication.

If you suspect that the data used is involved in causing the trouble, then please upload it (preferably a minimal subset) to figshare or similar services and provide a link to it.

all dependencies looks good except shiny for M1 Mac.

Why not remove this. I do not see any reason to use it except interactive plot.

THanks,

Jianshu

mmplot to support color by correlation to a selected contig

If the user could pick one contig and all contigs were then coloured based on the correlation of their coverage profiles and the coverage profile for the selected contig it would enable the user to benefit from the power of multiple coverage profiles within a single mmplot. This way one could easily distinguish between contigs that are closely together just by chance and contigs that are closely together in multiple coverage dimensions.

New Installing problems with R 3.5.1

Hello
I meet a problem below when i installing ‘mmgenome2' with R3.5.1(Rstudio version 1.1.453 ;win10 system 64bit).I follow your proposal and sucessfully installing 'data.table' package before remotes::install_github("kasperskytte/mmgenome2") ,but the problem can not be solved.
Thank you for your help!

remotes::install_github("kasperskytte/mmgenome2")
Downloading GitHub repo kasperskytte/mmgenome2@master
Downloading GitHub repo jkrijthe/Rtsne@openmp
Installing package into ‘C:/Users/user614/Documents/R/win-library/3.5’
(as ‘lib’ is unspecified)

  • installing source package 'Rtsne' ...
    ** libs

*** arch - i386
Warning in system(cmd) : 'make' not found
ERROR: compilation failed for package 'Rtsne'

  • removing 'C:/Users/user614/Documents/R/win-library/3.5/Rtsne'
  • restoring previous 'C:/Users/user614/Documents/R/win-library/3.5/Rtsne'
    In R CMD INSTALL
    Downloading GitHub repo Rdatatable/data.table@master
    Installing package into ‘C:/Users/user614/Documents/R/win-library/3.5’
    (as ‘lib’ is unspecified)
  • installing source package 'data.table' ...
    ** libs

*** arch - i386
Warning in system(cmd) : 'make' not found
ERROR: compilation failed for package 'data.table'

  • removing 'C:/Users/user614/Documents/R/win-library/3.5/data.table'

  • restoring previous 'C:/Users/user614/Documents/R/win-library/3.5/data.table'
    In R CMD INSTALL
    Skipping 1 packages ahead of CRAN: data.table
    Installing 4 packages: crosstalk, hexbin, htmlwidgets, plotly
    package ‘crosstalk’ successfully unpacked and MD5 sums checked
    package ‘hexbin’ successfully unpacked and MD5 sums checked
    package ‘htmlwidgets’ successfully unpacked and MD5 sums checked
    package ‘plotly’ successfully unpacked and MD5 sums checked
    Installing package into ‘C:/Users/user614/Documents/R/win-library/3.5’
    (as ‘lib’ is unspecified)

  • installing source package 'mmgenome2' ...
    ** R
    ** data
    *** moving datasets to lazyload DB
    ** byte-compile and prepare package for lazy loading
    Note: possible error in 'layout(margin = list(l = 50, ': 参数没有用[means Parameter is useless](margin = list(l = 50, r = 0, b = 150, t = 20))
    Note: wrong number of arguments to 'sqrt'
    Note: wrong number of arguments to 'floor'
    ** help
    *** installing help indices
    converting help for package 'mmgenome2'
    finding HTML links ... 好了[means done]
    mmexpand_network html
    mmexport html
    finding level-2 HTML links ... done

    mmextract html
    mmload html
    mmlocator html
    mmmerge html
    mmmerge_linkfiles html
    mmnetwork html
    mmplot html
    mmplot_cov_profiles html
    mmplot_pairs html
    mmscanbins html
    mmstats html
    print.ggplot html
    ** building package indices
    ** installing vignettes
    ** testing if installed package can be loaded
    *** arch - i386
    *** arch - x64

  • DONE (mmgenome2)
    In R CMD INSTALL
    Warning messages:
    1: In missing_devel_warning(pkgdir) :
    Package Rtsne has compiled code, but no suitable compiler(s) were found. Installation will likely fail.
    Install Rtools and make sure it is in the PATH.
    2: In i.p(...) :
    installation of package ‘C:/Users/user614/AppData/Local/Temp/Rtmp4sZBNJ/remotes66586a2b5d2f/jkrijthe-Rtsne-1d0f926’ had non-zero exit status
    3: In missing_devel_warning(pkgdir) :
    Package data.table has compiled code, but no suitable compiler(s) were found. Installation will likely fail.
    Install Rtools and make sure it is in the PATH.
    4: In i.p(...) :
    installation of package ‘C:/Users/user614/AppData/Local/Temp/Rtmp4sZBNJ/remotes665877f348ba/Rdatatable-data.table-09495c1’ had non-zero exit status

Problems in mmgenome2 with own dataset

Initially, the error was "Error in .Call2("new_input_filexp", filepath, PACKAGE = "XVector") :
cannot open file 'data/assembly.fa", then when the path was provided it gave the following error:

"Error: The assembly contains duplicate sequence names"

Please help at the earliest.

Regards

Running mmstats on big assemblies cause integer overflow

When running mmstats on a big assembly (160000 contigs) the N50 calculatation fails with the following warning message. The stats are still returned but with an NA for the N50.

Warning message:
In which(cumsum(lengths) >= lengthTotal/2) :
integer overflow in 'cumsum'; use 'cumsum(as.numeric(.))'

extract.fastq.for.reassembly.pl: Use of uninitialized value

After extracting my scaffolds using the mmgenome2 package, I want to do the reassembly as the mmgenome tutorial suggests. I am therefore using the extract.fastq.for.reassembly.pl script.

Although the reads are read in correctly, I am getting an error when the reads are extracted:
Use of uninitialized value $extract in substitution (s///) at extract.fastq.for.reassembly.pl line 135, <infastq_fh> line 41.

I am using an interleaved fastq file, which looks like this:

@J00137:60:H7VKMBBXX:3:1101:1215:1173 1:N:0:CCACAACA
GAGAAACTGCTCAAGCAGCGTCGGATCGTCCAGTTGTGTCTCCTCCCCCACCGGCAACAGAGGTTTGGCCCCCATCAGGCCGCGGGCGAAGATGCCGACGCCTACAGGCTCGGCGGCCTCGCCTTCAGGACCATGACCCTGCCCCTCCCC
+
<AAAFJJFJJFFFFJJFJJJFJ<A-<<FAJJFJJJJFJ<F-FFFFF-FAFJJJ<JJ7F7AJJJFJJA<FJ-7A<7A-<J-A<<FAAFF<-77<FF<A-<<FJ----<-F)A77))-)-)<AJF7)77A-777---7-77))<)<<))-7<
@J00137:60:H7VKMBBXX:3:1101:1215:1173 2:N:0:CCACAACA
GTTCCACGCCTGCCGATTCGATTGGTCGGCGATCCGGAACGGCCGCGCGCCGTCGGCCTGGGCGATACCCCGCTCCCCGTAACGGGAAGGCCCTATTCCTTCCCCCGTCGGCAGGTTTCCTCCCTGCCGGACATCCGCGGCTTGATATCAC
+
AA-<FJJAJFJFJJ-F-AFJ--<J<-FA<F<<-7AF-A7A--JJ7JFJ<AJF-7<AF-7-7<J--7-777-77----77<AA7<---7------)FJF7--A)77A-)-<<JFF)---7--7AF))))))--77-A))-)))))7-----7
@J00137:60:H7VKMBBXX:3:1101:2493:1173 1:N:0:CCACAACA
AACATCGCAATCGTCATCGGTCCGACTCCACCTGGTACTGGCGTGATGTGACTTGTCTTTTTACTTACCTTTTCAAAATCAACATCTCCTGTAATGCGATATCCGCGTTTACGTGTCTCGTCTGGAACTCTTGTGATTCCACCCTCAATT
+
AAFFJJFJJJFJJJJFJJJJJJJJJJJJJJJJJJFJJJJJJJJJJJ<JJJFFJJJJJJJJJJJJJJJJJJJJFJJ7A-AJJJJJFJJJJJFJ<AJJFFFJJFF<FJJF<J--AJF<A-A)-77A<<AJ--<F-AF-AFF-<-))7)A-<-
@J00137:60:H7VKMBBXX:3:1101:2493:1173 2:N:0:CCACAACA
GATGAGTATTTTGATGGGAAGAAAAGGATTTCCTGGAAACTCTACAGTAACCTTAACACACAGTCACACCAAAAACATTACGCAAATTACCTGTCAGGCAGACATTATCATCACTGCTTTAGGACAACCTAATGTCTTAAAAGCAGAAATG
+

And my sam file:

J00137:60:H7VKMBBXX:3:1101:6086:31031	0	1	1	60	18S133M	*	0	0	GTCGCTGCTTGGAGCGCTTTGCAGAAAACCTCACGCCCTCGTTGCTCATCGCCCAAGCCGTAGGAGCAGACCAACACTTAACGCAGACATAGCCTAAATAAATCGTCGGATAACCAATGGAAAGCAACAGCATTCAGCTACCCGGCTCCGA	AAFFFJJJJJJJJJJJJJJJJJFJFAFJJJFJJJJJJJJJJJFJJFJJJJJJJJJJJJJJJFJJFJJFJJJJJJJJ-FFJJJJJAJ<JFFFJJJFJFFFJJJJAA<FJJF7AFJFAF<FF7F77<F<F<FJF7<A7AJ<7FFFFJJ---<-	NM:i:0	MD:Z:133	AS:i:133	XS:i:95
J00137:60:H7VKMBBXX:3:1101:14834:39752	0	1	1	43	25S126M	*	0	0	GGGTGGCGTCGCTGCTTGGAGCGCTTTGCAGAAAACCTCACGCCCTCGTTGCTCATCGCCCAAGCCGTAGGAGCAGACCAACACTTAACGCAGACATAGCCTAAATAAATCGTCGGATAACCAATGGCAAGCAACAGCATTCAGCTACCCG	AAFAFFJFJJJJJJJJFJJJJJJJJJJJJJJ-FJJJJ<JFJFJJJJJJJJJJ7JFFJ<AJJFJJJJF7JJJAJJFJFJJ<FJAJFAFJJJJAJ<FJFJJJJ7AJJJFJJFFAFJFFJJJJJJAJAJJ-A777<FFF<FJ7FFJAFFJJ-F<	NM:i:1	MD:Z:102A23	AS:i:121	XS:i:102	XA:Z:9,+54391,102M49S,0;92,-8561,50S101M,0;
J00137:60:H7VKMBBXX:3:1101:25012:45432	0	1	1	60	17S134M	*	0	0	TCGCTGCTTGGAGCGCTTTGCAGAAAACCTCACGCCCTCGTTGCTCATCGCCCAAGCCGTAGGAGCAGACCAACACTTAACGCAGACATAGCCTAAATAAATCGTCGGATAACCAATGGAAAGCAACAGCATTCAGCTACCCGGCTCCGAG	AAFFFJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJFJJFJJJJJJAJJJJJJJJJJFJFJJJJJJJJFFJFJ<JFJFFFFJJF<FFJJJFJ-<AJJJJFJJJJJFFJJJJFAJJFJJJJJJJFJ-FFJFA	NM:i:0	MD:Z:134	AS:i:134	XS:i:94
perl extract.fastq.for.reassembly.pl -s file.sam -r ref.fasta -f reads.fq.gz -o extracted.fq

run mmload without coverage info

It should be possible to run mmload without supplying coverage information.
e.g.
mmload(assembly = "assembly.fa", kmer_BH_tSNE = T)

Failed to install it in R4.0

My system is Windows,and I have installed Rtools3.5 before the installation of mmgenome2.
After the first failure,I installed package Rtsne and backports bacause it told me that the auto installation didn't work.
What I have done doesn't solve the problem.
Now it shows:

remotes::install_github("kasperskytte/mmgenome2")
Downloading GitHub repo kasperskytte/mmgenome2@master
These packages have more recent versions available.
It is recommended to update all of them.
Which would you like to update?

1: All
2: CRAN packages only
3: None
4: Rtsne (0.15 -> 1d0f926d8...) [GitHub]
5: backports (1.1.6 -> 1.1.7 ) [CRAN]
6: xfun (0.13 -> 0.14 ) [CRAN]

Enter one or more numbers, or an empty line to skip updates:
1
Rtsne (0.15 -> 1d0f926d8...) [GitHub]
backports (1.1.6 -> 1.1.7 ) [CRAN]
tinytex (NA -> 0.23 ) [CRAN]
xfun (0.13 -> 0.14 ) [CRAN]
modelr (NA -> 0.1.8 ) [CRAN]
Installing 4 packages: backports, tinytex, xfun, modelr
Installing packages into ‘C:/Users/ColdSymmetry/Documents/R/win-library/4.0’
(as ‘lib’ is unspecified)

There are binary versions available but the source versions are later:
binary source needs_compilation
backports 1.1.6 1.1.7 TRUE
tinytex 0.22 0.23 FALSE
xfun 0.13 0.14 FALSE
modelr 0.1.7 0.1.8 FALSE

installing the source packages ‘backports’, ‘tinytex’, ‘xfun’, ‘modelr’

trying URL 'https://cran.rstudio.com/src/contrib/backports_1.1.7.tar.gz'
Content type 'application/x-gzip' length 17676 bytes (17 KB)
downloaded 17 KB

trying URL 'https://cran.rstudio.com/src/contrib/tinytex_0.23.tar.gz'
Content type 'application/x-gzip' length 27142 bytes (26 KB)
downloaded 26 KB

trying URL 'https://cran.rstudio.com/src/contrib/xfun_0.14.tar.gz'
Content type 'application/x-gzip' length 69436 bytes (67 KB)
downloaded 67 KB

trying URL 'https://cran.rstudio.com/src/contrib/modelr_0.1.8.tar.gz'
Content type 'application/x-gzip' length 121333 bytes (118 KB)
downloaded 118 KB

  • installing source package 'backports' ...
    ** 成功将'backports'程序包解包并MD5和检查(unpackaged 'backports' succesfully and MD5 checked)
    ** using staged installation
    ** libs

*** arch - i386
make: C:/PROGRA1/R/R-401.0/etc/i386/Makeconf: No such file or directory
make: *** No rule to make target 'C:/PROGRA1/R/R-401.0/etc/i386/Makeconf'. Stop.
ERROR: compilation failed for package 'backports'

  • removing 'C:/Users/ColdSymmetry/Documents/R/win-library/4.0/backports'
  • restoring previous 'C:/Users/ColdSymmetry/Documents/R/win-library/4.0/backports'
    Error: Failed to install 'unknown package' from GitHub:
    (converted from warning) installation of package ‘backports’ had non-zero exit status

need a wiki for generating basic input data

It would be nice with some simple examples on how to generate the most basic input data: assembly + coverage

Estimate the amount of data you need

(100%/target_abundance%) * average_genomesize * coverage_needed
e..g
If the abundance is 1 % and you need 50x coverage to produce a good assembly and the genome size is 5 Mbp you will need
(100%/1%) * 5 * 10^6bp * 50 =5 * 10^9bp=5 Gbp

Trim your reads for poor quality and adapters using cutadapt

QUALITY=20; # Remove reads with a phred score below
THREADS=60; # Number of threads to use

Adapter sequences

NEX_ADP1=CTGTCTCTTATACACATCT # Illumina Nextera adaptor sequences
NEX_ADP2=CTGTCTCTTATACACATCT # Illumina Nextera adaptor sequences
TRU_ADP1=AGATCGGAAGAGCACACGTCTGAACTCCAGTCA # Illumina TruSeq and NEB Nebnext adaptor sequences
TRU_ADP2=AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT # Illumina TruSeq and NEB Nebnext adaptor sequences
cutadapt -a $NEX_ADP1 -a $TRU_ADP1 -A $NEX_ADP2 -A $TRU_ADP2 -o data/SEQID.R1.trimmed.fastq --paired-output data/SEQID.R2.trimmed.fastq data/SEQID.R1.fastq data/SEQID.R2.fastq

Assemble genome using your favourite assembler e.g. spades, megahit.

Map illumina reads to assembly using minimap2 (https://github.com/lh3/minimap2)"

minimap2 -ax sr -t $THREADS assembly.fasta data/SEQID.R1.trimmed.fastq data/SEQID.R2.trimmed.fastq | samtools view --threads $THREADS -Sb - | samtools sort --threads $THREADS - -o temp/SEQID.aln.sorted.bam

echo "Generate coverage data" | tee -a log.txt;

(MA solution from multi metagenome https://github.com/MadsAlbertsen/multi-metagenome/blob/master/misc.scripts/calc.coverage.in.bam.depth.pl)

samtools depth temp/SEQID.aln.sorted.bam > temp/SEQID.depth.txt
FILEend=cov.csv;
perl calc.coverage.in.bam.depth.pl -i temp/SEQID.depth.txt -o results/SEQID_cov.csv

add function to plot GC skew

As long reads start to produce more circular genomes it might be useful to plot the GC skew (https://en.wikipedia.org/wiki/GC_skew) for a given contig. Example code and plot below:

`library(ggplot2)
library(dplyr)
library(seqinr)

gcskew <- function(x) {
if (!is.character(x) || length(x) > 1)
stop("single string expected")
tmp <- tolower(s2c(x))
nC <- sum(tmp == "c")
nG <- sum(tmp == "g")
if (nC + nG == 0) return(NA)
return(100 * (nG - nC)/(nC + nG))
}
plotgcskew<- function(fasta="EcoliK12MG1655_reference.fasta",step=10000,wsize=10000){
myseq <- read.fasta(file = fasta,as.string = TRUE)
cols <- c("LINE1"="blue","LINE2"="black")
starts <- seq(from = 1, to = nchar(myseq), by = step)
starts <- starts[-length(starts)]
n <- length(starts)
result <- numeric(n)
for (i in seq_len(n)) {
result[i] <- gcskew(substr(myseq, starts[i], starts[i] + wsize - 1))
}

Plot the result:

position <- starts/1000 # in Kbp
GCskew <- result
gcskew_data<-data.frame(position,GCskew) %>% mutate(cumGC=cumsum(GCskew),
normGCskew=(GCskew-max(GCskew))/(max(GCskew)-min(GCskew)),
normcumGC=(cumGC-max(cumGC))/(max(cumGC)-min(cumGC)))
ggplot(data = gcskew_data,aes(x = position,y = normGCskew))+geom_line(aes(col="gc skew"))+geom_line(aes(col="Cumulative GC skew",y = normcumGC))+ggtitle("GC skew plot")+xlab("Position (kbp)")
}`

gcskew

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.