Code Monkey home page Code Monkey logo

Comments (5)

epruesse avatar epruesse commented on September 4, 2024

The trace just says that when Emirge tried to launch Usearch, no memory was available to do so. It's quite possible that someone else launched something on your lab server that temporarily used all the available memory. To exclude that, can you check the memory usage while Emirge is running? If you hit G and 3 while in top, you should get a list of running processes sorted by memory usage in decending order (in some versions of top it's a lower case g). Alternatively, check ps -eo rss,args | sort -nr |head to get your top memory using processes and ps -eo rss, args | grep -iE "usearch|emirge" to get the memory usage by usearch and emirge.

from emirge.

cmorganl avatar cmorganl commented on September 4, 2024

Here's the top output, listed in order of memory usage, on the second iteration:

PID   %MEM    VIRT    RES   CODE    DATA    SHR nMaj nDRT  %CPU COMMAND
30616 18.6 47.188g 0.046t      4 46.888g   8476    0    0  99.7 emirge.py
34502  1.9 6486924 4.800g    332 6471044   1104   23    0  2351 bwa

emirge had just finished reading the bam file from iteration 1 ([fai_load] build FASTA index). RAM usage steadily grows between iterations.

from emirge.

epruesse avatar epruesse commented on September 4, 2024

I didnt' read your original post closely enough. You have 50M reads from single-cell sequencing. The >2k different candidates you have in iteration 5 looked more like amplicons or a metagenome (doesn't bode too well for you reads, I'm afraid). 50M reads is a little too much, Emirge maintains matrices in memory that scale proportionally to the number of reads and the maximum read length. But you should be able to reduce the number drastically by prefiltering them like so:

bbmap.sh in=original_reads_1.fq in2=original_reads_2.fq \
         outm=ssu_reads_1.fq outm2=ssu_reads_2.fq \
         minid=0.7 \
         ref=PATH_TO_SILVA_SSU_FASTA

You can add slow or vslow to further increase sensitivity. If your organisms' SSUs are very far from anything published, you could repeat the process after EMIRGE using the EMIRGE output sequences concatenated with the SILVA sequences. I doubt either is necessary, though.

To be honest, for the purpose of pre-screening for contaminants, just using bbmap with scafstats=bbmap_hits.txt would already tell you something. The file will contain information on how many reads where mapped to which SILVA reference sequence. It won't yield very reliable taxonomic assignments or abundances, but it'll work.

Actually, you may want to just try this tool here: https://github.com/HRGV/phyloFlash
It's meant to quickly give you a quick-and-dirty idea of what is in your sample(s) when you get the sequences. It'll do the pre-filtering and "ad hoc classification" with bbmap as explaned above, and also run emirge and spades. On your sequences.

from emirge.

cmorganl avatar cmorganl commented on September 4, 2024

Thanks for the tips, @epruesse. I'll check out bbmap (maybe khmer for digital normalization too) and feed these outputs to emirge.
PhyloFlash looks promising - any word on a future publication? I'd like to understand it better... without skimming through the source code :)

I have one SAG with fewer than 32 million reads and 25 million reads were aligned in iteration 2 (a lot of 16S hits!):

# reads processed: 24952624
# reads with at least one reported alignment: 13155437 (52.72%)
# reads that failed to align: 11797187 (47.28%)
Reported 44717550 alignments to 1 output stream(s)

I also have a metagenome with 50 million reads that was able to complete because there were magnitudes fewer alignments to the SSU database. Over 42 million reads were aligned at iteration 2:

# reads processed: 42872661
# reads with at least one reported alignment: 482 (0.00%)
# reads that failed to align: 42872179 (100.00%)

So - drastically different proportions of 16S in these two datasets. I can now see why EMIRGE is exhausting the RAM!

I'm curious, what is the difference between the reads used in each iteration? How does EMIRGE decide what subset to use for each iteration? I was wondering if it made sense to specify the number of reads to align at once. Is it possible to only realign those reads that initially aligned to the database (after masking) in further iterations, i.e., to hone in on the reads derived from 16S sequences?

from emirge.

epruesse avatar epruesse commented on September 4, 2024

PhyloFlash looks promising - any word on a future publication? I'd like to understand it better... without skimming through the source code :)

Hopefully yes, but I don't have a time line yet.

# reads with at least one reported alignment: 13155437 (52.72%)
# reads that failed to align: 11797187 (47.28%)

Emirge 1 uses bowtie 1, which can't do indels. A fraction of about 50% aligned reads like above is what I would expect for only 16S. Are you sure you have a SAG dataset there?

what is the difference between the reads used in each iteration?

All reads are used in each iteration, it is the candidate sequences that change. In iteration 0, the entire SILVA based reference dataset is used as candidates. In subsequent iterations, the candidates are pruned down, mutated, and sometimes "forked". The EM algorithm attempts to gradually increase the likelihood for the set to have led to the observed reads.

from emirge.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.