Hi! It seems the 'abundance' rule maps each sample's reads against the bins generated from that same sample. Reads are not mapped against bins generated from other samples. Is this understanding correct?
If so - why not create one output directory of dereplicated bins from the whole analysis (i.e. flatten the output from reassembled_bins/{sampleID}/reassembled_bins), and map each sample's reads against that directory? Otherwise, how else could you evaluate the abundance of a single species among samples?
I am testing this with the outputs from a relatively unsuccessful analysis (still running the pipeline with better assembly parameters). I input 48 bins (which were mostly very low quality, minimum completeness only 15% to ensure every sample would pass through the metaGEM pipeline). These 48 were combined into only 2 genome clusters. I do expect the results to change when I input better genomes.
Here is how I just tried implementing this using anvi'o. This R code creates the anvi'o input file (with genome names and file paths):
b_directories <- list.files("./reassembled_bins/", pattern = "HARV", full.names = T)
out <- data.frame(name = NULL, path = NULL)
for (sample in b_directories){
samp.bins <- list.files(paste0(sample, "/reassembled_bins"), pattern=".fa", full.names = T)
bin_names <- paste0(basename(sample), ".", basename(samp.out))
samp.bin.info <- cbind(name = bin_names, path = samp.out)
out <- rbind(out, samp.bin.info)
}
write_tsv(out, "./external-genomes.tsv")
module load fastani
module load miniconda
module load anvio
conda activate anvio-7.0
anvi-dereplicate-genomes -f external-genomes.tsv -o derep_genomes/ —-skip-fasta-report --program fastANI --similarity-threshold 0.85 —-representative-method Qscore
The two bins selected as "representative" for each cluster could then be used for mapping. Do you think this is a sound approach?