madsalbertsen / mmgenome Goto Github PK

View Code? Open in Web Editor NEW

27.0 27.0 8.0 310.94 MB

Please use mmgenome2 instead. Tools for extracting individual genomes from metagenomes

Home Page: https://kasperskytte.github.io/mmgenome2/

Perl 2.66% R 2.45% Shell 0.13% HTML 94.27% JavaScript 0.01% CSS 0.49%

mmgenome's People

Contributors

Stargazers

Watchers

Forkers

besaulud sorenkarst kasperskytte yuboer tankmermaid jianghexiliu bikmi pythseq

mmgenome's Issues

problems installing mmgenome in R

devtools installed ok , but then I got this:

library(devtools)
Loading required package: usethis
devtools::install_github("MadsAlbertsen/mmgenome/mmgenome")
Downloading GitHub repo MadsAlbertsen/mmgenome@HEAD
Error in utils::download.file(url, path, method = method, quiet = quiet, :
download from 'https://api.github.com/repos/MadsAlbertsen/mmgenome/tarball/HEAD' failed

Have you seen this and can you offer any advice?

This worked cleanly at the command line:

git clone https://github.com/MadsAlbertsen/mmgenome.git

My system:

R version 4.1.3 (2022-03-10) -- "One Push-Up"

RStudio 2022.02.1+461 "Prairie Trillium" Release (8aaa5d470dd82d615130dbf663ace5c7992d48e3, 2022-03-17) for macOS
Mozilla/5.0 (Macintosh; Intel Mac OS X 11_6_4) AppleWebKit/537.36 (KHTML, like Gecko) QtWebEngine/5.12.10 Chrome/69.0.3497.128 Safari/537.36

Model Name: MacBook Pro
Model Identifier: MacBookPro15,2
Processor Name: Quad-Core Intel Core i7
Processor Speed: 2.7 GHz
Number of Processors: 1
Total Number of Cores: 4
L2 Cache (per Core): 256 KB
L3 Cache: 8 MB
Hyper-Threading Technology: Enabled
Memory: 16 GB
System Firmware Version: 1715.81.2.0.0 (iBridge: 19.16.10744.0.0,0)
Serial Number (system): C02XT0V5JHD3
Hardware UUID: 59E685F5-B82B-5221-9B17-3417B3C06DED
Provisioning UDID: 59E685F5-B82B-5221-9B17-3417B3C06DED
Activation Lock Status: Disabled

System Version: macOS 11.6.4 (20G417)
Kernel Version: Darwin 20.6.0
Boot Volume: Macintosh HD
Boot Mode: Normal
Secure Virtual Memory: Enabled
System Integrity Protection: Enabled
Time since boot: 7:24

Not a problem, but a couple of feature request

Hi, Mads,

Great tool; much thanks! Could you add a function to segregate out by pps taxon (or marker gene taxon), such as:

function(df, name, level, omit = F) {
name <- paste(toupper(substring(name,1,1)), substring(name,2,), sep = "")
level <- tolower(level)
new_df <- list(scaffolds = data.frame(), essential = data.frame())
if (omit == F) {
eval(parse(text = paste("new_df$scaffolds <- df$scaffolds[df$scaffolds$pps_", level, " == "", name,"", ]", sep = "")))
} else if (omit == T) {
eval(parse(text = paste("new_df$scaffolds <- df$scaffolds[df$scaffolds$pps_", level, " != "", name,"", ]", sep = "")))
} else {
message("The "omit" value should be either True or False")
stop()
}
new_df$scaffolds <- new_df$scaffolds[is.na(new_df$scaffolds$scaffold) == F, ]
new_df$essential <- df$essential[df$essential$scaffold %in% new_df$scaffolds$scaffold, ]
return(new_df)
}

Also, could you add a function to slice the list of dataframes by coverage?

I find myself doing both things quite frequently...

mmplot_locator: Locator coordinate system out of sync when using custom axis limits

mmplot_locator is used with a ggplot object often generated by mmplot. If custom axis limits are added to the ggplot object e.g.with scale_x_log10(limits=c(x1, x2)) the coordinate system in mmplot_locator will be out of sync with the coordinate system in the ggplot object. The wrong area of the plot is selected as a result.

Workaround: Instead of zooming by using axis limits, zoom by using mmplot_locator and mmextract to subset your data.

Importing FASTA data using Biostrings into mmgenome

Just wanting to give the package a try and I'm having a hard time getting past loading the example data. I'm getting this error in Biostrings -- might not have to do with the mmgenome package, but I can't find any solutions via google.

> assembly <- readDNAStringSet("data/assembly.fa", format = "fasta")
Error in .Call2("new_input_ExternalFilePtr", fp, PACKAGE = "Biostrings") : 
  cannot open file 'data/assembly.fa'

Here's my R information:

> sessionInfo()
R version 3.2.0 (2015-04-16)
Platform: x86_64-apple-darwin13.4.0 (64-bit)
Running under: OS X 10.10.4 (Yosemite)

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
 [1] stats4    parallel  grid      stats     graphics  grDevices utils     datasets 
 [9] methods   base     

other attached packages:
 [1] mmgenome_0.6.1       dplyr_0.4.1          reshape2_1.4.1       igraph_0.7.1        
 [5] Biostrings_2.36.1    XVector_0.8.0        IRanges_2.2.1        S4Vectors_0.6.0     
 [9] BiocGenerics_0.14.0  vegan_2.3-0          lattice_0.20-31      permute_0.8-4       
[13] knitr_1.10.5         ggplot2_1.0.1        gridExtra_0.9.1      sp_1.1-0            
[17] BiocInstaller_1.18.2

How best do you recommend importing FASTA files?

Thanks so much for your time and developing a needed tool.

mmplot_locator error

Hej, I just started to try mmgenome and after successful installation i am now stuck in the basic step of selecting clusters. When I try to reproduce the 'genome_extraction' example and type mmplot_locator(p) i get the following error message:

> sel <- mmplot_locator(p)
Error in grid.Call.graphics(L_downviewport, name$name, strict) : 
  Viewport 'panel.3-4-3-4' was not found
In addition: Warning messages:
1: Transformation introduced infinite values in continuous x-axis 
...

then I looked in the source of the function and saw that the name of the viewport was hardcoded as panel.3-4-3-4, but for some reason on my system the viewport seems to default to layout....
after changing the viewport name (saving it in a new function my_locator) I was able to set some selection points, but then received the following message when trying to select with the right mouse:

> sel <- my_locator(p)
Error in if (pi$panel$x_scales[[1]]$trans$name == "log-10") d[, 1] <- 10^(d[,  : 
  argument is of length zero
In addition: Warning messages:
1: Transformation introduced infinite values in continuous x-axis 
...

any idea on how to fix the problem would be highly appreciated!

I am running OpenSuse Leap 42.2 and my R version is 3.3.2.
Thanks!

edit: my sessionInfo:

> sessionInfo()
R version 3.3.2 (2016-10-31)
Platform: x86_64-suse-linux-gnu (64-bit)
Running under: openSUSE Leap 42.2

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
 [1] grid      stats4    parallel  stats     graphics  grDevices utils    
 [8] datasets  methods   base     

other attached packages:
 [1] mmgenome_0.6.3      dplyr_0.5.0         reshape2_1.4.2     
 [4] igraph_1.0.1        Biostrings_2.42.0   XVector_0.14.0     
 [7] IRanges_2.8.1       S4Vectors_0.12.0    BiocGenerics_0.20.0
[10] vegan_2.4-1         lattice_0.20-34     permute_0.9-4      
[13] knitr_1.15          ggplot2_2.2.0       gridExtra_2.2.1    
[16] sp_1.2-3           

loaded via a namespace (and not attached):
 [1] Rcpp_0.12.8      magrittr_1.5     cluster_2.0.5    zlibbioc_1.20.0 
 [5] MASS_7.3-45      munsell_0.4.3    colorspace_1.3-0 R6_2.2.0        
 [9] stringr_1.1.0    plyr_1.8.4       tools_3.3.2      gtable_0.2.0    
[13] nlme_3.1-128     mgcv_1.8-15      DBI_0.5-1        digest_0.6.10   
[17] lazyeval_0.2.0   assertthat_0.1   tibble_1.2       Matrix_1.2-7.1  
[21] labeling_0.3     stringi_1.1.2    scales_0.4.1

How to generate "tax.txt"

Hi Everyone,

I am a beginner on metagenome binning.
Following the instructions, I am preparing my own R data sets.

In the file "rRNA.sh", the "16S.fa" could be generated, however the generation of "tax.txt" was not mentioned:

"makeblastdb -in assembly.fa -dbtype nucl > temp.blast2
blastn -query /space/users/malber06/Desktop/Databases/gg_90id/gg_90.fasta -db assembly.fa -num_threads 60 -max_target_seqs 5 -outfmt 6 -evalue 1e-10 -out temp.blast.txt

perl /space/users/malber06/mmgenome/scripts/extract.long.hits.from.blast.pl -b temp.blast.txt -d assembly.fa -m 500 -o 16S.fa"

Could anyone clue me in
Regards

Chen

Alphahull v.1.0.0 broken

The new version of Alphahull (1.0.0) is broken. We have contacted the author. Current fix is to require Alphahull v. 0.2-1 instead: http://cran.r-project.org/src/contrib/Archive/alphahull/.

Labelling of contigs

A useful feature could be to label contigs based on other columns than the sequence ID. e.g. the 16S rRNA gene classification of a contig.

Attempt to set 'colnames' on an object with less than two dimensions

There error occurs when I call mmload with the "other" parameter set to a tab delimited file containing scaffolds and gc contents as generated by the calc.gc.pl script in the multi-metagenome R.data.generation here: http://madsalbertsen.github.io/multi-metagenome/docs/step5.html. My GC file looks like this:

contig gc
scaffold00001 61.59
scaffold00002 59.92
scaffold00003 63.19
scaffold00004 59.11
scaffold00005 63.08

In R, this file is loaded using read.table. It's str() is:
'data.frame': 1421152 obs. of 2 variables:
$ contig: Factor w/ 1421152 levels "scaffold00001",..: 1 2 3 4 5 6 7 8 9 10 ...
$ gc : num 61.6 59.9 63.2 59.1 63.1 ...'

The full error is:
Error in colnames<-(*tmp*, value = "scaffold") :
attempt to set 'colnames' on an object with less than two dimensions

I haven't had the opportunity to look through the source code to find where this error is coming from. If I try mmload without adding this file, it works.

mmplot to Highlight contigs that have already been binned

It should be possible to supply mmplot with a folder path e.g. "bins/" with the .fasta files and then highlight all contigs that are already included in the previous bins.

Info on citation should be on the front page

Info on how to cite the package should be on the front page
"mmgenome: a toolbox for reproducible genome extraction from metagenomes"
Soeren M Karst, Rasmus H Kirkegaard, Mads Albertsen
bioRxiv 059121; doi: https://doi.org/10.1101/059121

Extracting 16S rRNA sequences

Hi,

I'm fairly new to this and was wondering where I could access the gg_90.fasta file used to extract 16S rRNA sequences in the rRNA.sh script. Is this a reference to the Greengenes database (ie: latest version is gg_13_5) or is it something else entirely?

Thanks so much!

plot coverage variation of a bin across all samples

Make a function to plot coverage variation of a bin across all samples

Something like this

mmplot_coverage<-function(data,ncov=2,log.y=F,plotline=F) { dm<-melt(data$scaffolds,measure.vars = names(data$scaffolds)[c(4:(ncov+3))]) p<-ggplot(dm, aes(x = variable, y = value,col=tax,size=length)) + geom_point(alpha=0.1)+ theme_bw()+theme(axis.text.x = element_text(angle = 90))+ylab("Coverage") if (log.y == T){p <- p + scale_y_log10()} if (plotline==T){p <- p +geom_line(aes(group = scaffold),size=1)} return(p) }

Using contigs instead of scaffolds for the genome extraction workflow

Hi,

I used MEGAHIT to assemble a sediment metagenome. However, I ran into issues with finding a suitable scaffolder for converting the contigs to scaffolds.

Can I go ahead with the mmgenome workflow only with contigs? Is there a recommended scaffolder for metagenomic contigs?

Thanks

Detailed description of import file for network.pl and extract.fastq.for.reassembly.pl

Hello, thanks for these helpful scripts you provided!
However, can you explain the import file for these script in detail. like,how can we get the .sam format file in these script(let the assembly file mapping to the reads from the sample using Bowtie2 or other software,it's that a right process?) How about the other parameters(-inref; -infastq...) can you give an example for me.
Thank you!

Best regards

New issue about running MEGAN

Hello,I met a issue when i running MEGAN4.x with your script:(and i successfully finished the BLAST step )

MEGAN fatal error:
java.lang.NumberFormatException: For input string: "32644;37965;134367;2323;28384;61964;48510;47936;186616;12908;48479;156614;367897;"
java.lang.NumberFormatException: For input string: "32644;37965;134367;2323;28384;61964;48510;47936;186616;12908;48479;156614;367897;"

Is there any way to fix this issue?
thank you for your help!

Difficulty installing

When installing mmgenome from scratch with:
source("https://bioconductor.org/biocLite.R")
biocLite("Biostrings")
devtools::install_github("madsalbertsen/mmgenome/mmgenome")
it stops when installing a required package Rtsne.multicore with the error:

"Error in FUN(X[[i]], ...) :
Invalid comparison operator in dependency: >= "

Simply using biocLite("madsalbertsen/mmgenome/mmgenome") instead causes no trouble!

Import example sam file for the network.pl script

Check inputfiles for correctness

During file loading it should be checked if the files are properly formatted.

Confirm that use of BLAST's `-max_target_seqs` is intentional

Hi there,

This is a semi-automated message from a fellow bioinformatician. Through a GitHub search, I found that the following source files make use of BLAST's -max_target_seqs parameter:

workflows/rRNA.sh

Based on the recently published report, Misunderstood parameter of NCBI BLAST impacts the correctness of bioinformatics workflows, there is a strong chance that this parameter is misused in your repository.

If the use of this parameter was intentional, please feel free to ignore and close this issue but I would highly recommend to add a comment to your source code to notify others about this use case. If this is a duplicate issue, please accept my apologies for the redundancy as this simple automation is not smart enough to identify such issues.

Thank you!
-- Arman (armish/blast-patrol)

Metadata in data(eg) is incorrect for Candidatus genera

Hi Mads!

I just noticed, that the Genus and Species columns in data(eg) are incorrect for strain names with a Candidatus prefix.

Thanks!
Bela

PS: taxonomic data on all levels (family, etc) would also be convenient :)

Demo:

library(mmgenome)
data(eg)
subset(eg, Genus == 'Candidatus')[1:10, 1:10]

    Accession                            Phylum                  Class      Genus         Species
77  NC_005061                    Proteobacteria    Gammaproteobacteria Candidatus     Blochmannia
88  NC_005861  Chlamydiae/Verrucomicrobia group             Chlamydiae Candidatus  Protochlamydia
121 NC_007205                    Proteobacteria    Alphaproteobacteria Candidatus    Pelagibacter
186 NC_008009 Fibrobacteres/Acidobacteria group          Acidobacteria Candidatus      Koribacter
216 NC_008512                    Proteobacteria    Gammaproteobacteria Candidatus      Carsonella
222 NC_008536 Fibrobacteres/Acidobacteria group          Acidobacteria Candidatus      Solibacter
237 NC_008610                    Proteobacteria    Gammaproteobacteria Candidatus          Ruthia
291 NC_009465                    Proteobacteria    Gammaproteobacteria Candidatus Vesicomyosocius
374 NC_010424                        Firmicutes             Clostridia Candidatus    Desulforudis
376 NC_010482                      Korarchaeota Candidatus Korarchaeum Candidatus     Korarchaeum
                                                         Strain   Length   GC  Kingdom PF00162
77                            Candidatus Blochmannia floridanus 0.705557 27.4 Bacteria       1
88                  Candidatus Protochlamydia amoebophila UWE25 2.414460 34.7 Bacteria       1
121                     Candidatus Pelagibacter ubique HTCC1062 1.308760 29.7 Bacteria       1
186                   Candidatus Koribacter versatilis Ellin345 5.650370 58.4 Bacteria       1
216                             Candidatus Carsonella ruddii PV 0.159662 16.6 Bacteria       0
222                    Candidatus Solibacter usitatus Ellin6076 9.965640 61.9 Bacteria       1
237 Candidatus Ruthia magnifica str. Cm (Calyptogena magnifica) 1.160780   34 Bacteria       1
291                      Candidatus Vesicomyosocius okutanii HA 1.022150 31.6 Bacteria       1
374                  Candidatus Desulforudis audaxviator MP104C 2.349480 60.8 Bacteria       1
376                     Candidatus Korarchaeum cryptofilum OPF8 1.590760   49  Archaea       1

Simplify data structure

Change main data structure from a list of two data.frames to just one data.frame.

mmplot: duplicates = TRUE produces error when no duplicates are present

When the option duplicates= TRUE in mmplot and the dataset doesn't contain duplicate genes an error occures.

Error in if (dup[i, 1] != dup[j, 1] & dup[i, 2] == dup[j, 2]) { :
missing value where TRUE/FALSE needed

Workaround: Ignore it.

mmplot_locator(p) warning

All other things work fine, any idea to solve the problem below?

Thanks,
Penny

mmplot_locator(p)
[1] "~Tab = c(NaN, NaN, NaN, NaN, NaN)"
[1] "~gc = c(NaN, NaN, NaN, NaN, NaN)"
~Tab ~gc
1 NaN NaN
2 NaN NaN
3 NaN NaN
4 NaN NaN
5 NaN NaN
Warning messages:
1: In max(xr) : no non-missing arguments to max; returning -Inf
2: In min(xr) : no non-missing arguments to min; returning Inf
3: In min(xr) : no non-missing arguments to min; returning Inf
4: In max(yr) : no non-missing arguments to max; returning -Inf
5: In min(yr) : no non-missing arguments to min; returning Inf
6: In min(yr) : no non-missing arguments to min; returning Inf

Scaffold names not recognized?

Hey Mads,

Just having a small problem which is probably simple but I can't get it to work!

For some reason I keep getting the error 'The coverage file contains scaffolds with names not found in the assembly'. I've done a couple of checks and I'm pretty sure this isn't true - must be something to do with how I'm calling them in.

I can reproduce this with just the first few scaffs, which I can attach if you like (is there an easy way to attach files here??). My commands are below:

read in assembly file

assembly <- readDNAStringSet("data/10scaffsAS0.fa", format = "fasta")

now read in coverage information

AS1 <- read.table("data/4scaffs.coverage.csv", header = T, sep = ",")[,c("Reference.sequence", "Average.coverage")]
AS2 <- read.table("data/4scaffs02.coverage.csv", header = T, sep = ",")[,c("Reference.sequence", "Average.coverage")]

now to merge this data

d <- mmload(assembly = assembly,
pca = T,
coverage = c("AS1", "AS2"),
)
[1] "Loading scaffold length, coverage and gc."
Error in mmload(assembly = assembly, pca = T, coverage = c("AS1", "AS2"), :
The coverage file AS1 contains scaffolds with names not found in the assembly. Make sure that the names of the scaffolds in the assembly and coverage files are identical.
In addition: Warning message:
In mmload(assembly = assembly, pca = T, coverage = c("AS1", "AS2"), :
The coverage file AS1 contains less scaffolds than the assembly. Setting the coverage of the missing scaffolds to 0.