tobiasgf / lulu Goto Github PK
View Code? Open in Web Editor NEWr package for post-clustering curation of amplicon next generation sequencing data (metabarcoding)
License: GNU Lesser General Public License v3.0
r package for post-clustering curation of amplicon next generation sequencing data (metabarcoding)
License: GNU Lesser General Public License v3.0
Hey,
In theory, setting a min-ratio of 1 would result in more OTUs being flagged as errors than if setting min ratio to 0.01, as in this case lulu will only flag as errors OTUs that are 100 times less abundant (1/100) than their potential parent in all samples (ratio_type at "min"), correct?I ran lulu on a small dataset (attached) and observe the opposite: when the ratio is at 1, lulu curation result sin ~20 more OTUs, when all other parameters are left the same (min identity at 84% and co-occurence at 0.90). Could you explain what is happening please?
lulu-test.zip
Thank you for the nice algorithm.
I have a question regarding the minimum_match
threshold. I have supplied LULU with a matchlist generated by vsearch
at 84% sequence similarity, and then run LULU at multiple thresholds. For example,
lulu(
otutable = asv_tab,
matchlist = asv_matches,
minimum_match = 93
)
I am finding that at some thresholds children ASVs that are below minimum match are merged (e.g., at 93% minimum match I get a match of 91.5 merged; at 95% minimum match several children between 94-95% are merged).
Is there a simple explanation for this? I thought maybe rounding, but the 91.5% match at 93% minimum would seem to indicate that's not the issue. So far this has affected only very low frequency ASVs (in terms of sample count) so it's not a huge issue, but curious to know if this is intentional.
Thanks!
Best,
Eric
Is there plans to release a version of LULU on CRAN or Bioconductor to make installation easier? If not, it would at least help to tag a version release so it can be downloaded with a link. Thanks!
Hi Tobias,
I can't seem to find the script buildOTUtable_simple.sh
in the provided CLI scripts. Am I missing something?
Best,
Bryan
Hi,
I followed the very excellent(!) pipeline and scripts you provided as part of LULU to go from DADA2 ASV output to inputs for vsearch clustering using 'extrSamDADA2', and ran the Alfa_DADA2_vsearch script. All works fine until the uc2otutab.py script is called, which has biopython fasta as a dependency. In order to install biopython in my virtual env. I had to use python3.6 which then produces print errors for scripts from the other python dependencies (e.g. click , uc) which must have been written for python2.X. After altering scripts to print in python3 (adding parentheses; not sure if that was the best route?), the Alfa_DADA2_vsearch.sh completes and the log shows no errors, and makes .uc files and .centroid files, but the .otu tables for each level of clustering are empty. It's not clear if my .uc (or even fasta) input files are off or if the uc2otutab.py script is not working due to a python version issue?
I would be happy to simply run vsearch directly on my files at some defined level of clustering and get otu ouput, i.e,
vsearch --cluster_size input.fasta --id 0.97 --relabel OTU --biomout otutable.biom
as you suggested here: torognes/vsearch#166
but Im not sure what file input to use? I have not used vsearch much before. I understand I should not use a dereplicated fasta- does the extrSamDADA2 script produce that file? Can I run usearch_global on the centroid and uc files with the --biomout argument?
Basically, the uc2otutab.py script is not working on my input files, so I am looking for an alternative way to cluster and output an otutable with vsearch. Ill attach my dummy sample fasta files that were input to the Alfa_DADA2_vsearch.sh script (two identical samples from a reduced dataset) to see if they are perhaps part of the issue?
Fastas.zip
Thanks for any guidance you can provide!
Louis
As per my question, I don't understand how to obtain OTU with a specific similarity threshold. The tutorial you provided in the main page of this github repository states that it is:
A step-by-step walk-through with the 97% clustered (VSEARCH) data from the LULU paper.
However, I see that in the walk-through you're using BLASTn for the matchlist (or I am missing the vsearch part) and then lulu()
with some parameters. What is not clear, to me, is how you set the parameters to have 97% similarity between final OTUs. So, is there some references that may help understand how to set parameters in order to, say, get a 95%, 97%, or, for example 93.25% similarity between the final clustered OTUs?
Hi There,
Is it possible to use LULU simply as a tool to cluster sequences at a certain percent ID? In other words, could I turn off (or set to a certain value) all of the other settings besides the cutoff? e.g. with the basic LULU code:
curated_result <- lulu(otutab, matchlist, minimum_ratio_type = "min", minimum_ratio = 1, minimum_match = 84, minimum_relative_cooccurence = 0.95)
I would like to ignore or effectively turn off the Minimum_ratio_type option and the minimum_relative cooccurance so that only the minimum_match is set (e.g. to 97). I understand that this may not be the intended use of LULU, but It seems to me that with the right settings, one could effectively minimize what the other options are doing? Im just not sure what to set them at. By leaving them at default, some of the curation still occurs.
Thanks for any help.
LP
Dear @tobiasgf
I think there is a undesired behavior with minimum_ratio
when minimum_ratio_type == "min"
, and to some degree when minimum_ratio_type == "avg"
. It happens in the code below:
if (relative_cooccurence >= minimum_relative_cooccurence) {
cat(paste0(" which is sufficient!"), file = log_con)
if (minimum_ratio_type == "avg") {
relative_abundance <-
mean(otutable[line2, ][daughter_samples > 0]/
daughter_samples[daughter_samples > 0])
cat(paste0("\n", "------mean avg abundance: ",
relative_abundance), file = log_con)
} else {
relative_abundance <-
min(otutable[line2, ][daughter_samples > 0]/
daughter_samples[daughter_samples > 0])
cat(paste0("\n", "------min avg abundance: ",
relative_abundance), file = log_con)
}
When minimum_relative_cooccurence
is smaller than 1.0 (default is 0.95), the potential parent OTU can be absent from some samples where the daughter OTU is present. A missing parent means that the ratio for that particular sample will be zero. This is were the issue is, the way minimum_ratio
is computed, it will identify the samples where the father is missing as the ones with the smallest ratio (zero). This is below minimum_ratio
and the father OTU is rejected, despite using minimum_relative_cooccurence
.
To solve that, the minimum_ratio should be searched among the non-null ratio values, not all ratio values.
When minimum_ratio_type == "avg"
, I think the average should be calculated on non-zeros ratio values too.
Hello, I have the following warning. It is not FATAL yet but could be in a future I guess!
**Warning message:
funs()
was deprecated in dplyr 0.8.0.
ℹ Please use a list of either functions or lambdas:
tibble::lst()
: tibble::lst(mean, median)ℹ The deprecated feature was likely used in the lulu package.
Please report the issue to the authors.**
Cheers!
Adrià
I have a very large matrix 1.5 million OTU x 12,000 sample matrix that I would like to run LULU on. Unfortunately as output of dada2 the OTU are columns. This is of course easy to fix using dplyr commands, however they are intractable on a matrix this large.
Is it possible to run LULU with OTU as columns? Thanks, Peter
Hi,
this is just a very tiny request.
The installation instructions are now:
> library(devtools)
> install_github("tobiasgf/lulu")
Could you change that to:
> library(devtools)
> install_github("tobiasgf/lulu")
# load the lulu package
> library(lulu)
I added the later step, since not everybody realizes that they still have to load the package before trying to use it.
It is simple but for non-bioinformaticians these steps are really needed.
Hi Tobias et al.
Running Lulu in Mjolnir wrapper - I have an error code
Error in if (relative_cooccurence >= minimum_relative_cooccurence) { :
missing value where TRUE/FALSE needed
A closed thread (#7) mentioned that this is due to an OTU with a zero value as a total in the dataset - but examination of the OTU table does not indicate any such instances
I wonder if you might have any advice ?
Many thanks
Martin
Hi Tobias,
I ran LULU as described in E_LULU_processing.RMD and got back empty _luluprocessed tables. I could see that the proc_min list (produced at line 87) was not empty, so thought that there may just be an issue with extracting the curated table. Indeed, I think I found what the issue is.
In line 88 of E_LULU_processing.RMD it reads:
curated_table <- proc_min$curated_OTU_table ## extracting the curated table
However, line 179 of the Functions.R script for LULU reads:
result <- list(curated_table = curation_table,
I changed line 88 of E_LULU_processing.RMD to:
curated_table <- proc_min$curated_table ## extracting the curated table
This produced an OTU table that has a smaller file size and contains fewer reads than the original OTU table, so I assume that this solved the problem. Does that seem correct to you?
Thanks,
Lauren
Dear lulu team,
Thanks so much for your great tool to improve reliability in metabarcoding datasets!
I came across one issue though: although I can install and run the package locally without a problem, I cannot seem to get it to install on my server environment. It keeps giving me the following error message using the devtools command:
"Error: Failed to install 'unknown package' from GitHub:
Line starting 'E ...' is malformed!"
I also tried using the githubinstall package and got this message:
"In fread(download_url, sep = "\t", header = FALSE, stringsAsFactors = FALSE, :
Found and resolved improper quoting out-of-sample. First healed line 4848: <<Puriney honfleuR "Evening, honfleuR" by Seurat>>. If the fields are not quoted (e.g. field separator does not appear within any field), try quote="" to avoid this warning."
I tried the suggestion quote="" but to no avail.. Does this ring any bell with you or is it just a problem with dependancies on the server environment?
Hope you can help me! Thanks in advance.
Best regards
Marcel Polling
Dear Tobias,
Thanks for this package, it works nicely!
I would like to integrate it into my JAMP pipeline. But the progress report does give a lot of feedback. I would like to suppress that, but can't since print is used to report the progress. Could you use message instead? Then it's easier to hide = )
Also, an option to just give one message for each single % step (instead of several) would be super helpful.
Thank you! Keep up the good work!
Best
Vasco
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.