songweizhi / markermag Goto Github PK

View Code? Open in Web Editor NEW

21.0 3.0 2.0 8.18 MB

Linking MAGs with 16S rRNA marker genes

License: GNU Affero General Public License v3.0

Python 95.15% Perl 2.09% Shell 2.76%

markermag's Introduction

MarkerMAG: linking MAGs with 16S rRNA marker genes using paired-end short reads

Publication

Song WZ, Zhang S, Thomas T* (2022) MarkerMAG: linking metagenome-assembled genomes (MAGs) with 16S rRNA marker genes using paired-end short reads, Bioinformatics. https://doi.org/10.1093/bioinformatics/btac398
Contact: Dr. Weizhi Song ([email protected]), Prof. Torsten Thomas ([email protected])
Center for Marine Science & Innovation, University of New South Wales, Sydney, Australia

Updates

2022-05-08 - MarkerMAG is now available on Bioconda, please refers to "How to install" for details.
2022-03-12 - A demo dataset (together with command) has now been provided! You can use it to check if MarkerMAG is installed successfully on your system.

MarkerMAG modules

Main module
- link: linking MAGs with 16S rRNA marker genes
Supplementary modules
- rename_reads: rename paired reads (manual)
- matam_16s: assemble 16S rRNA genes with Matam (manual)
- barrnap_16s: identify 16S rRNA genes from genomes/MAGs with Barrnap (manual)

How to install

MarkerMAG is implemented in python3, It has been tested on Linux and MacOS, but NOT on Windows.

A Conda package that automatically installs MarkerMAG's third-party dependencies (except Usearch ⚠️) is now available. Please note that you'll need to install Usearch on your own as it's not available in Conda due to license issue.

# install with 
conda create -n MarkerMAG_env -c bioconda MarkerMAG

# To activate the environment    
conda activate MarkerMAG_env
# MarkerMAG is ready for running now, type "MarkerMAG -h" for help

# To deactivate the environment
conda deactivate

It can also be installed with pip. Software dependencies need to be in your system path in this case. Dependencies for the link module include BLAST+, Barrnap, seqtk, Bowtie2, Samtools, HMMER, metaSPAdes and Usearch. Dependencies for the supplementary modules are provided in their corresponding manual page.
```
# install with 
pip3 install MarkerMAG
  
# upgrade with 
pip3 install --upgrade MarkerMAG
```
Here are some example commands for UNSW Katana users.
⚠️ If you clone the repository directly off GitHub you might end up with a version that is still under development.

How to run

MarkerMAG’s input consists of
1. A set of user-provided MAGs
2. A set of 16S rRNA gene sequences (either user-provided or generated with the matam_16s module)
3. Input reads need to be quality-filtered and in fasta format (no quality score).
⚠️ MarkerMAG is designed to work with paired short-read data (i.e. Illumina). It assumes the id of reads in pair in the format of XXXX.1 and XXXX.2. The only difference is the last character. You can rename your reads with MarkerMAG's rename_reads module (manual).
Although you can use your preferred tool to reconstruct 16S rRNA gene sequences from the metagenomic dataset, MarkerMAG does have a supplementary module (matam_16s) to reconstruct 16S rRNA genes. Please refer to the manual here if you want to give it a go.

Link 16S rRNA gene sequences with MAGs (demo dataset):

MarkerMAG link -p Demo -r1 demo_R1.fasta -r2 demo_R2.fasta -marker demo_16S.fasta -mag demo_MAGs -x fa -t 12

Output files

Summary of identified linkages at genome level:

Marker MAG Linkage Round

matam_16S_7 MAG_6 181 Rd1

matam_16S_12 MAG_9 102 Rd1

matam_16S_6 MAG_59 55 Rd2

Marker	MAG	Linkage	Round
matam_16S_7	MAG_6	181	Rd1
matam_16S_12	MAG_9	102	Rd1
matam_16S_6	MAG_59	55	Rd2

Summary of identified linkages at contig level (with figure):

Marker___MAG (linkages)	Contig	Round_1	Round_2
matam_16S_7___MAG_6(181)	Contig_1799	176	0
matam_16S_7___MAG_6(181)	Contig_1044	5	0
matam_16S_12___MAG_9(102)	Contig_840	102	0
matam_16S_6___MAG_59(39)	Contig_171	0	55

Copy number of linked 16S rRNA genes.
Visualization of individual linkage.

MarkerMAG supports the visualization of identified linkages (needs Tablet). Output files for visualization (example) can be found in the [Prefix]_linkage_visualization_rd1/2 folders. You can visualize how the linking reads are aligned to MAG contig and 16S rRNA gene by double-clicking the corresponding ".tablet" file. Fifty Ns are added between the linked MAG contig and 16S rRNA gene.

*If you saw error message from Tablet that says input files format can not be understood, please refer to here for a potential solution.

markermag's People

Contributors

Stargazers

Watchers

Forkers

utguang ivanv87

markermag's Issues

Mini-assembly not working

Hi,

Thanks for the nice tool. When I try to run this command:

MarkerMAG link -p test -marker reconstructed_16.fa -mag MAGs -x fa -r1 ETHSEQ0000005220.1.fq -r2 ETHSEQ0000005220.2.fq -t 8 -o output_test

I have the following error:

[2022-06-24 17:23:26] parameters for linking
 + mismatch:	2%
 + min_M_len:	45bp
 + min_M_pct:	35%
 + min_link_num_gnm:	9
 + min_link_num_ctg:	3
 + rd2_end_seq_len:	1000bp
 + max_short_cigar_pct:	75,85
[2022-06-24 17:23:26] parameters for estimating copy number
 + MAG_cov_subsample_pct:	25%
 + min_insert_size_16s:	-1000bp
 + ignore_ends_len_16s:	150bp
 + ignore_lowest_pct:	25%
 + ignore_highest_pct:	25%
 + both_pair_mapped:	False
[2022-06-24 17:23:26] Rd1: identifying 16S rRNA genes in input MAGs with barrnap
[2022-06-24 17:23:32] Rd1: identify 16S rRNA genes in input MAGs finished
[2022-06-24 17:23:32] Rd1: removing 16S sequences at the end of MAG contigs
[2022-06-24 17:23:32] Rd1: remove 16S sequences at the end of MAG contigs finished
[2022-06-24 17:23:32] Rd1: quality control provided 16S rRNA gene sequences to:
[2022-06-24 17:23:32] Rd1: remove non-16S sequences (if any)
[2022-06-24 17:23:32] Rd1: cluster at 99% identity and keep only the longest one in each cluster
[2022-06-24 17:23:37] Rd1: qualified 16S rRNA gene sequences exported to: reconstructed_16_polished_min1200bp_c99.fa
[2022-06-24 17:23:38] Rd1: mapping input reads to marker genes with 8 cores (be patient!)
11455318 reads; of these:
  11455318 (100.00%) were unpaired; of these:
    11434701 (99.82%) aligned 0 times
    1864 (0.02%) aligned exactly 1 time
    18753 (0.16%) aligned >1 times
0.18% overall alignment rate
[2022-06-24 17:24:00] Rd1: sorting test_input_reads_to_16S.sam
[bam_sort_core] merging from 0 files and 8 in-memory blocks...
[2022-06-24 17:24:02] Rd1: calculating the number of lines per subset
[2022-06-24 17:24:02] Rd1: splitting sam file
[2022-06-24 17:24:04] Rd1: analysing mappping results with 8 threads
[2022-06-24 17:24:04] Rd1: removing splitted subsets from disk
[2022-06-24 17:24:04] Rd1: reading filtered alignments into dict
[2022-06-24 17:24:04] Rd1: extracting sequences of reads matched to 16S
[2022-06-24 17:24:07] Rd1: mapping extracted reads to input genomes
Unable to read file magic number
Unable to read file magic number
0 reads
0.00% overall alignment rate
[2022-06-24 17:24:31] Rd1: analysing mappping results
[2022-06-24 17:24:31] Rd1: processed 0.0k
[2022-06-24 17:24:31] Rd1: parsing MappingRecord dict to get linkages
[2022-06-24 17:24:31] Rd1: calculating pairwise 16S rRNA gene identities
[2022-06-24 17:24:33] Rd1: filtering linkages iteratively
[2022-06-24 17:24:33] Rd1: extracting linking reads for visualization
[2022-06-24 17:24:35] Rd1: visualizing 0 rd1 linkages with 8 threads
[2022-06-24 17:24:36] Rd2: get unlinked marker genes and genomes
[2022-06-24 17:24:36] Rd2: mapping input reads to the ends of contigs from unlinked genomes
11455318 reads; of these:
  11455318 (100.00%) were unpaired; of these:
    10775989 (94.07%) aligned 0 times
    647843 (5.66%) aligned exactly 1 time
    31486 (0.27%) aligned >1 times
5.93% overall alignment rate
[2022-06-24 17:25:18] Rd2: sorting mappping results
[bam_sort_core] merging from 0 files and 8 in-memory blocks...
[2022-06-24 17:25:21] Rd2: calculating the number of lines per subset
[2022-06-24 17:25:21] Rd2: splitting sam file
[2022-06-24 17:25:23] Rd2: reading in mappping results with 8 threads
[2022-06-24 17:25:24] Rd2: removing splitted subsets from disk
[2022-06-24 17:25:26] Rd2: running SPAdes on extracted reads
[2022-06-24 17:25:27] Mini-assembly not found! will report 1st round linkages only!
Traceback (most recent call last):
  File "/nfs/nas22/fs2202/biol_micro_sunagawa/Projects/EAN/NCCR_META_OMICS_EAN/data/raw/software/miniconda3/bin/MarkerMAG", line 134, in <module>
    link_16s.link_16s(args, config_dict)
  File "/nfs/nas22/fs2202/biol_micro_sunagawa/Projects/EAN/NCCR_META_OMICS_EAN/data/raw/software/miniconda3/lib/python3.9/site-packages/MarkerMAG/link_16s.py", line 4581, in link_16s
    for each_ctg_level_link in open(stats_GapFilling_ctg):
FileNotFoundError: [Errno 2] No such file or directory: 'output_test/test_rd2_wd/stats_GapFilling_ctg.txt'

And in the log file (test.log) I see:

[2022-06-24 17:25:26] Rd2: running SPAdes on extracted reads
[2022-06-24 17:25:26] spades.py --only-assembler -s output_test/test_rd2_wd/rd2_read_to_extract_flanking_both_R12_up.fa -o output_test/test_rd2_wd/mini_assembly_SPAdes_wd -t 8 -k 75,99,123 -m 1024 > output_test/test_rd2_wd/SPAdes_stdout.txt
[2022-06-24 17:25:27] Mini-assembly not found! will report 1st round linkages only!

where matam_assembly.py?

Hello dear developers:

Thank you for developing a tool that is very helpful for our research!

I have successfully installed the software using conda, but when configuring the database for matam_16s, I can't find the matam_db_preprocessing.py and matam_assembly.py. Maybe I should write the absolute path to it, but I don't know the location of this script. I hope for your help and sincerely thank you in advance!

Best,
xnw

sorted_sam16s error

Hello,

Thanks you for developing MarkerMAG, the tool looks really promising!

I'm having the following error every time I try to run MarkerMAG link.

Traceback (most recent call last):
File "/home/cardena/miniconda3/bin/MarkerMAG", line 126, in
link_16s.link_16s(args)
File "/home/cardena/miniconda3/lib/python3.7/site-packages/MarkerMAG/link_16s.py", line 2703, in link_16s
reads_vs_16s_sam = args['sorted_sam16s']
KeyError: 'sorted_sam16s'

The command I'm using is:

MarkerMAG link -p sponges_markerMAG_output -r1 sponge_R1.fasta -r2 sponge_R2.fasta -marker interesing_bacteria.fasta -mag ./binning/chosen_bins -x fa -t 64 -tmp -force

I tried with and without renaming fasta files using "MarkerMAG rename_reads" but I always get to the same error.
Am I missing something?

Thanks for clarification!
Anny

Syntax error when running demo

Hi there,

I recently installed MarkerMAG in our department grid using mamba, and when running the command to test on the demo dataset, this error was thrown:

MarkerMAG_env) jones@my-mgrid2-2:/proj/mykopat-ncommeg/MarkerMAG_demo_data$ MarkerMAG link -p Demo -r1 demo_R1.fasta -r2 demo_R2.fasta -marker demo_16S.fasta -mag demo_MAGs -x fa -t 16
Traceback (most recent call last):
File "/nethomes/jones/mambaforge/envs/MarkerMAG_env/bin/MarkerMAG", line 64, in
from MarkerMAG import link_16s
File "/nfs4/my-mgrid-s8/nethomes/jones/mambaforge/envs/MarkerMAG_env/lib/python2.7/site-packages/MarkerMAG/link_16s.py", line 4676
arg_for_cn = {**config_dict, **args}
^
SyntaxError: invalid syntax

Not sure if this is something I can fix myself, do you have any suggestions?

many thanks,
chris

samtools sort error: No such file or directory

Hello,

Thanks for your tool! I installed the 1.1.5 version today, and ran into a problem running the link module

>MarkerMAG link -p tmp_SRR13125477 -marker ../03_metagenome_reanalysis/assembly_SRR13125477.bacteria.fasta -mag selected_mags/SRR13125477/ -x fa -r1 SRR13125477_R1.fasta -r2 SRR13125477_R2.fasta -t 12 -o tmp_SRR13125477 -no_polish

[2022-02-02 06:30:25] parameters for linking
 + mismatch:    2%
 + min_M_len:   45bp
 + min_M_pct:   35%
 + min_link_num_gnm:    9
 + min_link_num_ctg:    3
 + rd2_end_seq_len:     1000bp
 + max_short_cigar_pct: 75,85
[2022-02-02 06:30:25] parameters for estimating copy number
 + MAG_cov_subsample_pct:       25%
 + min_insert_size_16s: -1000bp
 + ignore_ends_len_16s: 150bp
 + ignore_lowest_pct:   25%
 + ignore_highest_pct:  25%
 + both_pair_mapped:    False
[2022-02-02 06:30:29] Rd1: quality control provided 16S rRNA gene sequences to:  
[2022-02-02 06:30:29] Rd1: remove sequences shorter than 1200 bp
[2022-02-02 06:30:29] Rd1: cluster at 99% identity and keep only the longest one in each cluster
[2022-02-02 06:30:29] Rd1: qualified 16S rRNA gene sequences exported to:
[2022-02-02 06:30:29] assembly_SRR13125477.bacteria_unpolished_min1200bp_c99.fasta
[2022-02-02 06:30:30] Rd1: mapping input reads to marker genes
[2022-02-02 06:30:30] Rd1: sorting mappping results
Traceback (most recent call last):
  File "/data/tagirdzh/miniconda3/bin/MarkerMAG", line 133, in <module>
    link_16s.link_16s(args, config_dict)
  File "/data/tagirdzh/miniconda3/lib/python3.8/site-packages/MarkerMAG/link_16s.py", line 2765, in link_16s
    os.remove(input_reads_to_16s_sam_bowtie)
FileNotFoundError: [Errno 2] No such file or directory: 'tmp_SRR13125477/tmp_SRR13125477_rd1_wd/tmp_SRR13125477_input_reads_to_16S.sam'
[E::hts_open_format] Failed to open file tmp_SRR13125477/tmp_SRR13125477_rd1_wd/tmp_SRR13125477_input_reads_to_16S.sam
samtools sort: can't open "tmp_SRR13125477/tmp_SRR13125477_rd1_wd/tmp_SRR13125477_input_reads_to_16S.sam": No such file or directory

The mentioned .sam file, is, however, there. I also noticed that already after the program finished with the error, bowtie2-align was still running in the background for some time.

I'm using samtools v1.7 and bowtie2-align-s version 2.3.5.1

Am I doing something wrong?

Thanks,
Gulnara

barrnap_16S: 'No query genome detected'

Hello folks,

I'm pretty excited about MarkerMAG. Sadly, I can't seem to figure out the supplementary barrnap_16S module.

Standard barrnap works well on my setup. However, when I try to run barrnap 16S on MEGAHIT results as follows:

MarkerMAG barrnap_16s -p ALW -g /scratch/username/megahit/sample/ALW/final.contigs.fa -x fa -t 6 -force

Then I get the following error:
No query genome detected, program exited!

I have checked whether the file final.contigs.fa is in that location and contain stuff with more , and it does indeed seem to be there. I know I can run barrnap for now and just delete what I don't need, but it would be nice to see if there's anything I can do to use the simpler version.

Add the possibility to use `.gz` files

Hi,

I really like the tool interface!

Very often the reads file are gzipped (due to the big size of metagenomic samples), it would be great if you can add the possibility to provide directly .gz files as input for -r1 and -r2.