Code Monkey home page Code Monkey logo

Comments (9)

frederic-mahe avatar frederic-mahe commented on August 19, 2024

Hi @matiz,

regarding taxonomic assignment, you can indeed use vsearch (see vsearch/issues/73).

Swarm does not output a biom file yet. I designed it to be a clustering tool, not a fully fledged pipeline. To help users, I described on that page a way to to produce a table of amplicon occurrences or OTU occurrences. I hope this will useful to you.

from swarm.

colinbrislawn avatar colinbrislawn commented on August 19, 2024

Hello @frederic-mahe, @torognes

I understand that Swarm is a clustering tool, first and foremost. I also understand the importance of avoiding scope creep and support your decision to implement tools in vsearch instead of swarm: #48

Part of what makes Swarm so much better than other clustering tools is that it fundamentally changes the definition of an OTU. This new definition may need new approaches and databases in many downstream steps. The qiime pipeline now support swarm clustering... but it mangles taxonomic assignment because it still uses defines OTUs as '97% similar'.

My production environments needs a 'fully fledged pipeline,' and so I run qiime, usearch, and phyloseq. How do you envision swarm/vsearch growing in the future?

from swarm.

frederic-mahe avatar frederic-mahe commented on August 19, 2024

Hi @colinbrislawn,

sorry for the late reply.

If I understand correctly, QIIME performs taxonomic assignment on a database pre-clustered at 97%? You are searching for an alternative method to perform the taxonomic assignment. For my own analyses, I use vsearch (--usearch_global) against a reference database of sequences trimmed using the same primers than used for the amplicons. That method works well for Eukaryotes, as there are only appr. 100,000 18S references. Using the same approach for 16S amplicons might not be possible, as reference datasets are much larger for bacteria (10^6 reference sequences).

Asking the QIIME developers to implement the possibility to target a non-clustered reference database might be a temporary solution. Regarding vsearch, future versions could offer faster search, but there is nothing certain about that.

from swarm.

colinbrislawn avatar colinbrislawn commented on August 19, 2024

If I understand correctly, QIIME performs taxonomic assignment on a database pre-clustered at 97%? You are searching for an alternative method to perform the taxonomic assignment.

Exactly.

The database is pre-clustered using the same percent identity as used when picking OTUs. So OTUs made by swarm need a different approach which captures their variable resolution.

I'm not worried about the size of the database; I'm lucky enough to have access to ludicrous computational power. I think that greengenes and silva both ship unclustered databases which I may be able to drop into my pipeline.

I'm more worried about a method which will accurately characterize swarms with a variable radius. You mentioned using --usearch_global followed by finding the last common ancestor. Does this work well for 'plateau' swarms from a slowly evolving marker?

Thank you for your discussion and excellent software.

from swarm.

pbuttigieg avatar pbuttigieg commented on August 19, 2024

Greetings all,
Thanks for posting and discussing this!
It seems the issue has transitioned to a one of taxonomic classification.

I'm more worried about a method which will accurately characterize swarms with a variable radius. You mentioned using --usearch_global followed by finding the last common ancestor. Does this work well for 'plateau' swarms from a slowly evolving marker?

I'm also wondering how to best classify a given swarm, but perhaps using a per-swarm representative sequence. The seed sequence can be used, but is there a way to pull out the "central" and/or "extreme" (i.e. the sequences with the greatest number of edges between them) sequences of a swarm?

from swarm.

frederic-mahe avatar frederic-mahe commented on August 19, 2024

Hi @colinbrislawn, hi @pbuttigieg,

First, I am very sorry I did not answer your questions earlier (and sorry for the long answer).

To answer @pbuttigieg's question, if you use the -w option, swarm will output OTU representatives in fasta format (that's the central sequence). The most extreme sequence of an OTU is always the last to be added. So you just have to select the last sequence of each OTU in the "swarms" file to get the extremum. Note that this is not true if you use the fastidious option, grafted sequences will be added last.

Now, regarding taxonomic assignment. I can only speak about my own experience with 18S sequence, I am confident though that most of my observations can be transposed to 16S datasets. For most OTUs produced by swarm, using the representative sequence (i.e. the most abundant sequence) for downstream analyses is safe. The approach I use for my own analyses (assign to the best hit, and to the last common ancestor if there is a tie) has yielded good results so far. The approach is only limited by the variability of the marker and by the quality and coverage of the reference database.

If you want to have a look at my taxonomic assignment script, I pushed it to GitHub under the name stampa.

When the marker is not variable enough, "plateau" OTUs can appear. For these OTUs, it could be interesting to retain more than one representative sequence. Since we've included the breaking phase directly in the growth phase, (swarm 1.2.20 and later versions), plateaus are far less frequent (at least in my datasets). Now the question is how to spot a plateau OTU? The stats file gives some clues: plateau OTUs have high OTU mass vs. seed mass ratio values. Another way to identify plateau OTUs is to visualize the internal structure of OTUs.

If you use the -i output_file option, you can collect the internal structure of each OTU, as explored by swarm during the growth phase. Then, you can use the companion script graph_plot.py to produce a visualization. The script requires the module igraph and python 2.7+ to run:

python graph_plot.py -s data.swarms -i data.struct -o 5 -d 10

The option -o INTEGER targets the nth OTU in the dataset; the option -d INTEGER is a threshold: amplicons with that abundance or less will be discarded from the plot. It is useful to represent large OTUs (the rendering time is too long for OTUs with more than a few thousand unique amplicons).

With these plots, you can identify situations as this one. That OTU contains two similarly abundant sequences, separated by only one difference (circles are amplicons, edges represent 1 difference, edge length is meaningless, color is white for rare amplicons, blue for abundant ones, circle size also reflects abundance, abundance values are indicated for amplicons with an abundance equal or greater than 10). The two abundant forms have different ecological patterns (i.e. they do not co-occur in our samples) and may represent different populations. In that case, both dominant sequences are assigned to the same taxa, but if our reference database were to get more precise in the future, it would be interesting to consider two representatives for that OTU.

I'll be very interested to see if you can find interesting patterns like this in your own datasets.

from swarm.

dmvvliet avatar dmvvliet commented on August 19, 2024

Hi @frederic-mahe, @torognes,

firstly thanks for developing this tool and for discussing it here. I have used swarm for the clustering of 16S data and have classified the seed sequences with SINA and the Silva database. this seems to work pretty well for me, as swarms tend to stay under 7 generations with d=1 and the fastidious option. I have however some single swarms reaching a maximum generation as high as 29. I wanted to look further into this but encountered the problem that the swarms file is not updated with grafting info as mentioned a couple of times here already. I therefore looked into your suggestion of exploring the struct file with the graph_plot.py script. I saw that this python script purposefully excluded grafted sequences, so I edited it to make dropping fastidious sequences an option rather than a standard feature. however, the fastidiously added sequences have no edges connecting them to the main swarm. I understand from another discussion here that the struct file is still to be updated, and that proper analysis of swarms created with d=1 and the fastidious option is still in progress, is this correct?

as an easier-to-evaluate setting I also tried clustering with d=2, which gives me roughly the same amount of swarms. for the user community, would this not be a more easily evaluated alternative to d=1 with the fastidious option?

thanks, Daan

from swarm.

frederic-mahe avatar frederic-mahe commented on August 19, 2024

Hi @dmvvliet,

You are right, the structure file is not updated yet. However, I don't think this is important for what you are trying to do.

The fastidious option grafts low abundant amplicons onto more abundant amplicons (so, close to the core of the cluster). Therefore, it should not increase significantly the max radius of a given cluster. Large radius are most of the time due to "pseudopods" (sudden extensions in one narrow region of the amplicon-space). My advice is to compare a run with/without the fastidious option to check that.

Furthermore, a radius of 29 does not mean that there are 29 accumulated differences between the seed and the outermost amplicon. If you perform a pairwise alignment of the outermost amplicon against the seed, you will probably observe a smaller divergence value. This is rather counter-intuitive, and I am still exploring that property of swarm results.

Finally, with the increase of sequencing depth, we can expect to see swarms with increasing max radii: sequencing errors accumulate and saturate the layers of microvariants around the seed. Consequently, a seed with a large abundance value is also likely to have a large radius.

Best,

from swarm.

giriarteS avatar giriarteS commented on August 19, 2024

I was assigning taxonomy to my ITS2 representative sequences with the qiime script assign_taxonomy.py (rpd and blast options) using the UNITE database as reference (unfortunately UNITE doesn't contain oomycetes) but all my plant host ITS sequences were label as ascomycota (also when clustering with UPARSE I got many plant host OTUs but with swarm just the largest OTU corresponded to my plant host). Now I assign taxonomy with blast using a local database (http://www.emerencia.org/fungalitspipeline.html), then I use the python script fhitings.py (http://onlinelibrary.wiley.com/doi/10.1002/jobm.201200507/abstract) to classyfy the blast output results (using the lowest common ancestor method -LCA-) and produce a single identification for each OTU with taxonomic ranks assigned from species through kingdom when possible for each sequence based on the Index Fungorum database

from swarm.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.