Hey there, I'm using EMIRGE as a preliminary contamination screen fo

Thanks for the tips, <a class="user-mention notranslate" data-hovercard-type="user" da

Inordinate memory usage about emirge HOT 5 CLOSED

csmiller commented on September 4, 2024

Inordinate memory usage

from emirge.

Comments (5)

epruesse commented on September 4, 2024

The trace just says that when Emirge tried to launch Usearch, no memory was available to do so. It's quite possible that someone else launched something on your lab server that temporarily used all the available memory. To exclude that, can you check the memory usage while Emirge is running? If you hit G and 3 while in top, you should get a list of running processes sorted by memory usage in decending order (in some versions of top it's a lower case g). Alternatively, check ps -eo rss,args | sort -nr |head to get your top memory using processes and ps -eo rss, args | grep -iE "usearch|emirge" to get the memory usage by usearch and emirge.

from emirge.

cmorganl commented on September 4, 2024

Here's the top output, listed in order of memory usage, on the second iteration:

PID   %MEM    VIRT    RES   CODE    DATA    SHR nMaj nDRT  %CPU COMMAND
30616 18.6 47.188g 0.046t      4 46.888g   8476    0    0  99.7 emirge.py
34502  1.9 6486924 4.800g    332 6471044   1104   23    0  2351 bwa

emirge had just finished reading the bam file from iteration 1 ([fai_load] build FASTA index). RAM usage steadily grows between iterations.

from emirge.

epruesse commented on September 4, 2024

I didnt' read your original post closely enough. You have 50M reads from single-cell sequencing. The >2k different candidates you have in iteration 5 looked more like amplicons or a metagenome (doesn't bode too well for you reads, I'm afraid). 50M reads is a little too much, Emirge maintains matrices in memory that scale proportionally to the number of reads and the maximum read length. But you should be able to reduce the number drastically by prefiltering them like so:

bbmap.sh in=original_reads_1.fq in2=original_reads_2.fq \
         outm=ssu_reads_1.fq outm2=ssu_reads_2.fq \
         minid=0.7 \
         ref=PATH_TO_SILVA_SSU_FASTA

You can add slow or vslow to further increase sensitivity. If your organisms' SSUs are very far from anything published, you could repeat the process after EMIRGE using the EMIRGE output sequences concatenated with the SILVA sequences. I doubt either is necessary, though.

To be honest, for the purpose of pre-screening for contaminants, just using bbmap with scafstats=bbmap_hits.txt would already tell you something. The file will contain information on how many reads where mapped to which SILVA reference sequence. It won't yield very reliable taxonomic assignments or abundances, but it'll work.

Actually, you may want to just try this tool here: https://github.com/HRGV/phyloFlash
It's meant to quickly give you a quick-and-dirty idea of what is in your sample(s) when you get the sequences. It'll do the pre-filtering and "ad hoc classification" with bbmap as explaned above, and also run emirge and spades. On your sequences.

from emirge.

cmorganl commented on September 4, 2024

Thanks for the tips, @epruesse. I'll check out bbmap (maybe khmer for digital normalization too) and feed these outputs to emirge.
PhyloFlash looks promising - any word on a future publication? I'd like to understand it better... without skimming through the source code :)

I have one SAG with fewer than 32 million reads and 25 million reads were aligned in iteration 2 (a lot of 16S hits!):

# reads processed: 24952624
# reads with at least one reported alignment: 13155437 (52.72%)
# reads that failed to align: 11797187 (47.28%)
Reported 44717550 alignments to 1 output stream(s)

I also have a metagenome with 50 million reads that was able to complete because there were magnitudes fewer alignments to the SSU database. Over 42 million reads were aligned at iteration 2:

# reads processed: 42872661
# reads with at least one reported alignment: 482 (0.00%)
# reads that failed to align: 42872179 (100.00%)

So - drastically different proportions of 16S in these two datasets. I can now see why EMIRGE is exhausting the RAM!

I'm curious, what is the difference between the reads used in each iteration? How does EMIRGE decide what subset to use for each iteration? I was wondering if it made sense to specify the number of reads to align at once. Is it possible to only realign those reads that initially aligned to the database (after masking) in further iterations, i.e., to hone in on the reads derived from 16S sequences?

from emirge.

epruesse commented on September 4, 2024

PhyloFlash looks promising - any word on a future publication? I'd like to understand it better... without skimming through the source code :)

Hopefully yes, but I don't have a time line yet.

# reads with at least one reported alignment: 13155437 (52.72%)
# reads that failed to align: 11797187 (47.28%)

Emirge 1 uses bowtie 1, which can't do indels. A fraction of about 50% aligned reads like above is what I would expect for only 16S. Are you sure you have a SAG dataset there?

what is the difference between the reads used in each iteration?

All reads are used in each iteration, it is the candidate sequences that change. In iteration 0, the entire SILVA based reference dataset is used as candidates. In subsequent iterations, the candidates are pruned down, mutated, and sometimes "forked". The EM algorithm attempts to gradually increase the likelihood for the set to have led to the observed reads.

from emirge.

Inordinate memory usage about emirge HOT 5 CLOSED

Comments (5)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent