tobiasgf / lulu Goto Github PK

View Code? Open in Web Editor NEW

61.0 61.0 17.0 462 KB

r package for post-clustering curation of amplicon next generation sequencing data (metabarcoding)

License: GNU Lesser General Public License v3.0

R 8.47% Shell 85.38% Python 6.01% Perl 0.13%

lulu's People

Contributors

Stargazers

Watchers

Forkers

darcyj iadamo1 kepoff kasperskytte frederic-mahe 7pintsofcherrygarcia 54mu adachornelia jnchildress evaegelyng adrientaudiere gerverska emagallong chaochangmicro gabrielet sukses24

lulu's Issues

min ratio setting explained

Hey,
In theory, setting a min-ratio of 1 would result in more OTUs being flagged as errors than if setting min ratio to 0.01, as in this case lulu will only flag as errors OTUs that are 100 times less abundant (1/100) than their potential parent in all samples (ratio_type at "min"), correct?I ran lulu on a small dataset (attached) and observe the opposite: when the ratio is at 1, lulu curation result sin ~20 more OTUs, when all other parameters are left the same (min identity at 84% and co-occurence at 0.90). Could you explain what is happening please?
lulu-test.zip

Child below `minimum_match` similarity merged

Thank you for the nice algorithm.

I have a question regarding the minimum_match threshold. I have supplied LULU with a matchlist generated by vsearch at 84% sequence similarity, and then run LULU at multiple thresholds. For example,

lulu(
otutable = asv_tab,
matchlist = asv_matches,
minimum_match = 93
)

I am finding that at some thresholds children ASVs that are below minimum match are merged (e.g., at 93% minimum match I get a match of 91.5 merged; at 95% minimum match several children between 94-95% are merged).

Is there a simple explanation for this? I thought maybe rounding, but the 91.5% match at 93% minimum would seem to indicate that's not the issue. So far this has affected only very low frequency ASVs (in terms of sample count) so it's not a huge issue, but curious to know if this is intentional.

Thanks!
Best,
Eric

Tag a release, CRAN or Bioconductor?

Is there plans to release a version of LULU on CRAN or Bioconductor to make installation easier? If not, it would at least help to tag a version release so it can be downloaded with a link. Thanks!

Error in if (relative_cooccurence >= minimum_relative_cooccurence) { : missing value where TRUE/FALSE needed

Dear Tobias,

lulu has worked fine for most of my data, but now I received an error, where adjusting the lulu parameters did not help. --> "Error in if (relative_cooccurence >= minimum_relative_cooccurence) { : missing value where TRUE/FALSE needed"
Any idea how to fix this?

Thanks,
Sten

buildOTUtable_simple.sh missing from repo

Hi Tobias,

I can't seem to find the script buildOTUtable_simple.sh in the provided CLI scripts. Am I missing something?

Best,
Bryan

uc2otutab.py not producing otu table in Alfa_DADA2_vsearch.sh

Hi,
I followed the very excellent(!) pipeline and scripts you provided as part of LULU to go from DADA2 ASV output to inputs for vsearch clustering using 'extrSamDADA2', and ran the Alfa_DADA2_vsearch script. All works fine until the uc2otutab.py script is called, which has biopython fasta as a dependency. In order to install biopython in my virtual env. I had to use python3.6 which then produces print errors for scripts from the other python dependencies (e.g. click , uc) which must have been written for python2.X. After altering scripts to print in python3 (adding parentheses; not sure if that was the best route?), the Alfa_DADA2_vsearch.sh completes and the log shows no errors, and makes .uc files and .centroid files, but the .otu tables for each level of clustering are empty. It's not clear if my .uc (or even fasta) input files are off or if the uc2otutab.py script is not working due to a python version issue?

I would be happy to simply run vsearch directly on my files at some defined level of clustering and get otu ouput, i.e,

vsearch --cluster_size input.fasta --id 0.97 --relabel OTU --biomout otutable.biom

as you suggested here: torognes/vsearch#166

but Im not sure what file input to use? I have not used vsearch much before. I understand I should not use a dereplicated fasta- does the extrSamDADA2 script produce that file? Can I run usearch_global on the centroid and uc files with the --biomout argument?

Basically, the uc2otutab.py script is not working on my input files, so I am looking for an alternative way to cluster and output an otutable with vsearch. Ill attach my dummy sample fasta files that were input to the Alfa_DADA2_vsearch.sh script (two identical samples from a reduced dataset) to see if they are perhaps part of the issue?
Fastas.zip

Thanks for any guidance you can provide!

Louis

how to set OTU similarity using lulu parameters?

As per my question, I don't understand how to obtain OTU with a specific similarity threshold. The tutorial you provided in the main page of this github repository states that it is:

A step-by-step walk-through with the 97% clustered (VSEARCH) data from the LULU paper.

However, I see that in the walk-through you're using BLASTn for the matchlist (or I am missing the vsearch part) and then lulu() with some parameters. What is not clear, to me, is how you set the parameters to have 97% similarity between final OTUs. So, is there some references that may help understand how to set parameters in order to, say, get a 95%, 97%, or, for example 93.25% similarity between the final clustered OTUs?

LULU settings to use it to simply cluster at a given threshold?

Hi There,
Is it possible to use LULU simply as a tool to cluster sequences at a certain percent ID? In other words, could I turn off (or set to a certain value) all of the other settings besides the cutoff? e.g. with the basic LULU code:

curated_result <- lulu(otutab, matchlist, minimum_ratio_type = "min", minimum_ratio = 1, minimum_match = 84, minimum_relative_cooccurence = 0.95)

I would like to ignore or effectively turn off the Minimum_ratio_type option and the minimum_relative cooccurance so that only the minimum_match is set (e.g. to 97). I understand that this may not be the intended use of LULU, but It seems to me that with the right settings, one could effectively minimize what the other options are doing? Im just not sure what to set them at. By leaving them at default, some of the curation still occurs.

Thanks for any help.

potential parents with a relative cooccurence < 1.0 are never accepted

Dear @tobiasgf

I think there is a undesired behavior with minimum_ratio when minimum_ratio_type == "min", and to some degree when minimum_ratio_type == "avg". It happens in the code below:

          if (relative_cooccurence >= minimum_relative_cooccurence) {
            cat(paste0(" which is sufficient!"), file = log_con)
            if (minimum_ratio_type == "avg") {
              relative_abundance <-
                mean(otutable[line2, ][daughter_samples > 0]/
                       daughter_samples[daughter_samples > 0])
              cat(paste0("\n", "------mean avg abundance: ",
                         relative_abundance), file = log_con)
            } else {
              relative_abundance <-
                min(otutable[line2, ][daughter_samples > 0]/
                      daughter_samples[daughter_samples > 0])
              cat(paste0("\n", "------min avg abundance: ",
                         relative_abundance), file = log_con)
            }

When minimum_relative_cooccurence is smaller than 1.0 (default is 0.95), the potential parent OTU can be absent from some samples where the daughter OTU is present. A missing parent means that the ratio for that particular sample will be zero. This is were the issue is, the way minimum_ratio is computed, it will identify the samples where the father is missing as the ones with the smallest ratio (zero). This is below minimum_ratio and the father OTU is rejected, despite using minimum_relative_cooccurence.

To solve that, the minimum_ratio should be searched among the non-null ratio values, not all ratio values.

When minimum_ratio_type == "avg", I think the average should be calculated on non-zeros ratio values too.

warning message! `funs()` was deprecated in dplyr 0.8.0.

Hello, I have the following warning. It is not FATAL yet but could be in a future I guess!

**Warning message:
funs() was deprecated in dplyr 0.8.0.
ℹ Please use a list of either functions or lambdas:

Simple named list: list(mean = mean, median = median)

Auto named with `tibble::lst()`: tibble::lst(mean, median)

Using lambdas list(~ mean(., trim = .2), ~ median(., na.rm = TRUE))

ℹ The deprecated feature was likely used in the lulu package.
Please report the issue to the authors.**

Cheers!
Adrià

Input with OTU's as columns (not rows)

I have a very large matrix 1.5 million OTU x 12,000 sample matrix that I would like to run LULU on. Unfortunately as output of dada2 the OTU are columns. This is of course easy to fix using dplyr commands, however they are intractable on a matrix this large.

Is it possible to run LULU with OTU as columns? Thanks, Peter

Installation instruction not complete...

Hi,
this is just a very tiny request.

The installation instructions are now:

> library(devtools)
> install_github("tobiasgf/lulu")

Could you change that to:

> library(devtools)
> install_github("tobiasgf/lulu")  

# load the lulu package
> library(lulu)

I added the later step, since not everybody realizes that they still have to load the package before trying to use it.
It is simple but for non-bioinformaticians these steps are really needed.

Error in if (relative_cooccurence >= minimum_relative_cooccurence) { : missing value where TRUE/FALSE needed

Hi Tobias et al.

Running Lulu in Mjolnir wrapper - I have an error code

Error in if (relative_cooccurence >= minimum_relative_cooccurence) { :
missing value where TRUE/FALSE needed

A closed thread (#7) mentioned that this is due to an OTU with a zero value as a total in the dataset - but examination of the OTU table does not indicate any such instances

I wonder if you might have any advice ?

Many thanks

Martin

empty curated table fix

Hi Tobias,

I ran LULU as described in E_LULU_processing.RMD and got back empty _luluprocessed tables. I could see that the proc_min list (produced at line 87) was not empty, so thought that there may just be an issue with extracting the curated table. Indeed, I think I found what the issue is.

In line 88 of E_LULU_processing.RMD it reads:
curated_table <- proc_min$curated_OTU_table ## extracting the curated table

However, line 179 of the Functions.R script for LULU reads:
result <- list(curated_table = curation_table,

I changed line 88 of E_LULU_processing.RMD to:
curated_table <- proc_min$curated_table ## extracting the curated table

This produced an OTU table that has a smaller file size and contains fewer reads than the original OTU table, so I assume that this solved the problem. Does that seem correct to you?

Thanks,
Lauren

Can't install due to a malformed line...

Dear lulu team,

Thanks so much for your great tool to improve reliability in metabarcoding datasets!

I came across one issue though: although I can install and run the package locally without a problem, I cannot seem to get it to install on my server environment. It keeps giving me the following error message using the devtools command:

"Error: Failed to install 'unknown package' from GitHub:
Line starting 'E ...' is malformed!"

I also tried using the githubinstall package and got this message:

"In fread(download_url, sep = "\t", header = FALSE, stringsAsFactors = FALSE, :
Found and resolved improper quoting out-of-sample. First healed line 4848: <<Puriney honfleuR "Evening, honfleuR" by Seurat>>. If the fields are not quoted (e.g. field separator does not appear within any field), try quote="" to avoid this warning."

I tried the suggestion quote="" but to no avail.. Does this ring any bell with you or is it just a problem with dependancies on the server environment?

Hope you can help me! Thanks in advance.

Best regards

Marcel Polling

Suggestion: Use message instead of print for progress report

Dear Tobias,

Thanks for this package, it works nicely!

I would like to integrate it into my JAMP pipeline. But the progress report does give a lot of feedback. I would like to suppress that, but can't since print is used to report the progress. Could you use message instead? Then it's easier to hide = )

Also, an option to just give one message for each single % step (instead of several) would be super helpful.

Thank you! Keep up the good work!

Best
Vasco