bhattlab / mgefinder Goto Github PK

A toolbox for identifying mobile genetic element (MGE) insertions from short-read sequencing data of bacterial isolates.

License: MIT License

Shell 0.48% Python 99.45% Dockerfile 0.07%

mgefinder's People

Contributors

Stargazers

Watchers

Forkers

michellemli kdbrumfield inspectordidi gunzivan28 yiyanyang0728 menickname chuym726 xiangrong131 yananzh qwe526235 tomebio stan-iakhno alienzj thomcuddihy xuexiaohua-bio kalonji08

mgefinder's Issues

unable to run test data

here I run the test
"mgefinder workflow denovo ../test_workdir"

Missing input file error when file is in directory

Hello, I keep getting "MissingInputException" and "Missing input files for rule pair" when I run mgefinder workflow denote however, all files are in the correct directories. I have deleted them and remade the files with no luck. I have run MGEfinder previously without this error.

============================================================
===== Summary of your script job =====

The script file is: mgefinderworkflow.sh
The time limit is 150:00:00 HH:MM:SS.
The target directory is: /scratch/aubksw/MGEfinder/MGEfinder/cluster4
The working directory is: /scratch-local/aubksw.mgefinderworkflows.601211
The memory limit is: 8gb
The job will start running after: 2021-06-12T13:43:22
Job Name: mgefinderworkflows
Virtual queue: medium
QOS: --qos=medium
Constraints: --constraint=dmc
Using 6 cores on master node dmc19
Node list: dmc19
Nodes: dmc19 dmc19 dmc19 dmc19 dmc19 dmc19
Command typed:
/apps/scripts/run_script mgefinderworkflow.sh
Queue submit command:
sbatch --qos=medium -J mgefinderworkflows --begin=2021-06-12T13:43:22 --requeue --mail-user=[email protected] -o mgefinderworkflows.o601211 -t 150:00:00$

CHECKING DEPENDENCIES

PARAMETERS

command: workflow
workdir: /scratch/aubksw/MGEfinder/MGEfinder/cluster4
cores: 1
memory: 16000
unlock: False
rerun_incomplete: False
keep_going: False
sensitive: False
####################
MissingInputException in line 93 of /home/aubksw/anaconda3/envs/mgefinder/lib/python3.6/site-packages/mgefinder/workflow/denovo.original.Snakefile:
Missing input files for rule pair:
/scratch/aubksw/MGEfinder/MGEfinder/cluster4/00.bam/Xsp60.XretroflexusSp953.bam
COMMAND: snakemake -s /home/aubksw/anaconda3/envs/mgefinder/lib/python3.6/site-packages/mgefinder/workflow/denovo.original.Snakefile --config wd=/scratch$
Traceback (most recent call last):
File "/home/aubksw/anaconda3/envs/mgefinder/bin/mgefinder", line 8, in
sys.exit(cli())
File "/home/aubksw/anaconda3/envs/mgefinder/lib/python3.6/site-packages/click/core.py", line 764, in call
return self.main(*args, **kwargs)
File "/home/aubksw/anaconda3/envs/mgefinder/lib/python3.6/site-packages/click/core.py", line 717, in main
rv = self.invoke(ctx)
File "/home/aubksw/anaconda3/envs/mgefinder/lib/python3.6/site-packages/click/core.py", line 1137, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/home/aubksw/anaconda3/envs/mgefinder/lib/python3.6/site-packages/click/core.py", line 1137, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/home/aubksw/anaconda3/envs/mgefinder/lib/python3.6/site-packages/click/core.py", line 956, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/home/aubksw/anaconda3/envs/mgefinder/lib/python3.6/site-packages/click/core.py", line 555, in invoke
return callback(*args, **kwargs)
File "/home/aubksw/anaconda3/envs/mgefinder/lib/python3.6/site-packages/mgefinder/main.py", line 78, in denovo
_workflow(workdir, snakefile, configfile, cores, memory, unlock, rerun_incomplete, keep_going)
File "/home/aubksw/anaconda3/envs/mgefinder/lib/python3.6/site-packages/mgefinder/workflow.py", line 25, in _workflow
shell(cmd)
File "/home/aubksw/anaconda3/envs/mgefinder/lib/python3.6/site-packages/snakemake/shell.py", line 88, in new
raise sp.CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command 'snakemake -s /home/aubksw/anaconda3/envs/mgefinder/lib/python3.6/site-packages/mgefinder/workflow/denovo.original$

Errors at combining the inferseq files.

Hi!

I was getting the following errors when I run mgefinder workflow using my data sets. How can I resolve this issue?

Finished job 16.
7 of 16 steps (44%) done

rule make_database:
    input: ./01.mgefinder/NC_000962_3/NC_000962_3.all_inferseq.txt
    output: ./02.database/NC_000962_3/NC_000962_3.database.fna, ./02.database/NC_000962_3/NC_000962_3.database.fna.1.bt2
    jobid: 13
    benchmark: ./02.database/NC_000962_3/NC_000962_3.database.benchmark.txt
    wildcards: genome=NC_000962_3

#### CHECKING DEPENDENCIES ####
Current version of snakemake: 3.13.3
Expected version of snakemake: 3.13.3
Current version of einverted: EMBOSS:6.6.0.0
Expected version of einverted: EMBOSS:6.6.0.0
Current version of bowtie2: 2.3.5
Expected version of bowtie2: 2.3.5
Current version of samtools: 1.9
Expected version of samtools: 1.9
Current version of cd-hit: 4.8.1
Expected version of cd-hit: 4.8.1
###############################
#### PARAMETERS ####
command: makedatabase
inferseqfiles: ('./01.mgefinder/NC_000962_3/NC_000962_3.all_inferseq.txt',)
minimum_size: 30
maximum_size: 200000
threads: 1
memory: 16000
force: True
output_dir: ./02.database/NC_000962_3
prefix: NC_000962_3.database
####################
Parsing inferseq files
Combining the inferseq files...
Loading file 1/3: ./01.mgefinder/NC_000962_3/JPN-B2019-Rv-1224_S1_L001/03.inferseq_assembly.JPN-B2019-Rv-1224_S1_L001.NC_000962_3.tsv
Loading file 2/3: ./01.mgefinder/NC_000962_3/JPN-B2019-Rv-1224_S1_L001/03.inferseq_reference.JPN-B2019-Rv-1224_S1_L001.NC_000962_3.tsv
Loading file 3/3: ./01.mgefinder/NC_000962_3/JPN-B2019-Rv-1224_S1_L001/03.inferseq_overlap.JPN-B2019-Rv-1224_S1_L001.NC_000962_3.tsv
Deleting old database directory...
No termini found in the input file...
Waiting at most 5 seconds for missing files.
Error in job make_database while creating output files ./02.database/NC_000962_3/NC_000962_3.database.fna, ./02.database/NC_000962_3/NC_000962_3.database.fna.1.bt2.
MissingOutputException in line 192 of /home/user/anaconda3/envs/mgefinder/lib/python3.6/site-packages/mgefinder/workflow/Snakefile:
Missing files after 5 seconds:
./02.database/NC_000962_3/NC_000962_3.database.fna
./02.database/NC_000962_3/NC_000962_3.database.fna.1.bt2
This might be due to filesystem latency. If that is the case, consider to increase the wait time with --latency-wait.
Will exit after finishing currently running jobs.
Exiting because a job execution failed. Look above for error message
Traceback (most recent call last):
  File "/home/user/anaconda3/envs/mgefinder/bin/mgefinder", line 8, in <module>
    sys.exit(cli())
  File "/home/user/anaconda3/envs/mgefinder/lib/python3.6/site-packages/click/core.py", line 764, in __call__
    return self.main(*args, **kwargs)
  File "/home/user/anaconda3/envs/mgefinder/lib/python3.6/site-packages/click/core.py", line 717, in main
    rv = self.invoke(ctx)
  File "/home/user/anaconda3/envs/mgefinder/lib/python3.6/site-packages/click/core.py", line 1137, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/home/user/anaconda3/envs/mgefinder/lib/python3.6/site-packages/click/core.py", line 956, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/user/anaconda3/envs/mgefinder/lib/python3.6/site-packages/click/core.py", line 555, in invoke
    return callback(*args, **kwargs)
  File "/home/user/anaconda3/envs/mgefinder/lib/python3.6/site-packages/mgefinder/main.py", line 51, in workflow
    _workflow(workdir, snakefile, configfile, cores, memory, unlock, rerun_incomplete, keep_going)
  File "/home/user/anaconda3/envs/mgefinder/lib/python3.6/site-packages/mgefinder/workflow.py", line 26, in _workflow
    shell(cmd)
  File "/home/user/anaconda3/envs/mgefinder/lib/python3.6/site-packages/snakemake/shell.py", line 88, in __new__
    raise sp.CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command 'snakemake -s /home/user/anaconda3/envs/mgefinder/lib/python3.6/site-packages/mgefinder/workflow/Snakefile --config wd=. memory=16000 --cores 1 --configfile /home/user/anaconda3/envs/mgefinder/lib/python3.6/site-packages/mgefinder/workflow/config.yml ' returned non-zero exit status 1.

.
├── 00.assembly
│   ├── JPN-B2019-Rv-1224_S1_L001.fna
│   ├── JPN-B2019-Rv-1224_S1_L001.fna.1.bt2
│   ├── JPN-B2019-Rv-1224_S1_L001.fna.2.bt2
│   ├── JPN-B2019-Rv-1224_S1_L001.fna.3.bt2
│   ├── JPN-B2019-Rv-1224_S1_L001.fna.4.bt2
│   ├── JPN-B2019-Rv-1224_S1_L001.fna.rev.1.bt2
│   └── JPN-B2019-Rv-1224_S1_L001.fna.rev.2.bt2
├── 00.bam
│   ├── JPN-B2019-Rv-1224_S1_L001.NC_000962_3.bam
│   └── JPN-B2019-Rv-1224_S1_L001.NC_000962_3.bam.bai
├── 00.genome
│   ├── NC_000962_3.fna
│   ├── NC_000962_3.fna.1.bt2
│   ├── NC_000962_3.fna.2.bt2
│   ├── NC_000962_3.fna.3.bt2
│   ├── NC_000962_3.fna.4.bt2
│   ├── NC_000962_3.fna.amb
│   ├── NC_000962_3.fna.ann
│   ├── NC_000962_3.fna.bwt
│   ├── NC_000962_3.fna.pac
│   ├── NC_000962_3.fna.rev.1.bt2
│   ├── NC_000962_3.fna.rev.2.bt2
│   ├── NC_000962_3.fna.sa
│   └── log
│       ├── NC_000962_3.index_bowtie2.benchmark.txt
│       ├── NC_000962_3.index_bowtie2.log
│       └── NC_000962_3.index_bowtie2.log.err
├── 01.mgefinder
│   └── NC_000962_3
│       ├── JPN-B2019-Rv-1224_S1_L001
│       │   ├── 01.find.JPN-B2019-Rv-1224_S1_L001.NC_000962_3.tsv
│       │   ├── 02.pair.JPN-B2019-Rv-1224_S1_L001.NC_000962_3.tsv
│       │   ├── 03.inferseq_assembly.JPN-B2019-Rv-1224_S1_L001.NC_000962_3.tsv
│       │   ├── 03.inferseq_overlap.JPN-B2019-Rv-1224_S1_L001.NC_000962_3.tsv
│       │   ├── 03.inferseq_reference.JPN-B2019-Rv-1224_S1_L001.NC_000962_3.tsv
│       │   └── log
│       │       ├── JPN-B2019-Rv-1224_S1_L001.NC_000962_3.benchmark.txt
│       │       ├── JPN-B2019-Rv-1224_S1_L001.NC_000962_3.find.benchmark.txt
│       │       ├── JPN-B2019-Rv-1224_S1_L001.NC_000962_3.find.log
│       │       ├── JPN-B2019-Rv-1224_S1_L001.NC_000962_3.find.log.err
│       │       ├── JPN-B2019-Rv-1224_S1_L001.NC_000962_3.inferseq_assembly.benchmark.txt
│       │       ├── JPN-B2019-Rv-1224_S1_L001.NC_000962_3.inferseq_assembly.log
│       │       ├── JPN-B2019-Rv-1224_S1_L001.NC_000962_3.inferseq_assembly.log.err
│       │       ├── JPN-B2019-Rv-1224_S1_L001.NC_000962_3.inferseq_overlap.log
│       │       ├── JPN-B2019-Rv-1224_S1_L001.NC_000962_3.inferseq_overlap.log.err
│       │       ├── JPN-B2019-Rv-1224_S1_L001.NC_000962_3.inferseq_reference.benchmark.txt
│       │       ├── JPN-B2019-Rv-1224_S1_L001.NC_000962_3.inferseq_reference.log
│       │       ├── JPN-B2019-Rv-1224_S1_L001.NC_000962_3.inferseq_reference.log.err
│       │       ├── JPN-B2019-Rv-1224_S1_L001.NC_000962_3.pair.benchmark.txt
│       │       ├── JPN-B2019-Rv-1224_S1_L001.NC_000962_3.pair.log
│       │       └── JPN-B2019-Rv-1224_S1_L001.NC_000962_3.pair.log.err
│       └── NC_000962_3.all_inferseq.txt
└── 02.database
    └── NC_000962_3
        └── NC_000962_3.database.benchmark.txt

I confirmed both generating the test dataset and analyzing them using mgefinder workflow work fine.

Thank you in advance.

Yosm

filtering used to create *.all_seqs.fna file?

In our data set there seems to be some filtering that occurred between the 01.clusterseq..tsv and the *.all_seqs.fna files. Even after the 01.clusterseq..tsv file was filtered to remove duplicates based on the sequence inference method, it contains far more MGE sequences than the *.all_seqs.fna file. Could you direct me where to look to find the filtering information that was used to create the *.all_seqs.fna file? My main concern is that the 01.clusterseq..tsv file contains a high number of potentially false positive MGE sequences.
Thanks in advance for your help!

Error in job genotype while creating output

Hello! I have an error while running workflow denovo. I had the same problem with my samples and with your test directory.

Finished job 3.
75 of 79 steps (95%) done

rule genotype:
    input: test_workdir/03.results/efae_GCF_900639545/01.clusterseq.efae_GCF_900639545.tsv, test_workdir/01.mgefinder/efae_GCF_900639545/efae_GCF_900639545.all_pair.txt
    output: test_workdir/03.results/efae_GCF_900639545/02.genotype.efae_GCF_900639545.tsv
    log: test_workdir/03.results/efae_GCF_900639545/log/efae_GCF_900639545.genotype.log
    jobid: 4
    benchmark: test_workdir/03.results/efae_GCF_900639545/log/efae_GCF_900639545.genotype.benchmark.txt
    wildcards: genome=efae_GCF_900639545

zsh:2: = not found
Error in job genotype while creating output file test_workdir/03.results/efae_GCF_900639545/02.genotype.efae_GCF_900639545.tsv.
RuleException:
CalledProcessError in line 286 of /home/mk/miniconda3/envs/mgefinder/lib/python3.6/site-packages/mgefinder/workflow/denovo.original.Snakefile:
Command '
        if [ "True" == "True" ]; then
            mgefinder genotype --filter-clusters-inferred-assembly test_workdir/03.results/efae_GCF_900639545/01.clusterseq.efae_GCF_900639545.tsv test_workdir/01.mgefinder/efae_GCF_900639545/efae_GCF_900639545.all_pair.txt -o test_workdir/03.results/efae_GCF_900639545/02.genotype.efae_GCF_900639545.tsv 1> test_workdir/03.results/efae_GCF_900639545/log/efae_GCF_900639545.genotype.log 2> test_workdir/03.results/efae_GCF_900639545/log/efae_GCF_900639545.genotype.log.err ||             (cat test_workdir/03.results/efae_GCF_900639545/log/efae_GCF_900639545.genotype.log.err; exit 1)
        else
            mgefinder genotype --no-filter-clusters-inferred-assembly test_workdir/03.results/efae_GCF_900639545/01.clusterseq.efae_GCF_900639545.tsv test_workdir/01.mgefinder/efae_GCF_900639545/efae_GCF_900639545.all_pair.txt -o test_workdir/03.results/efae_GCF_900639545/02.genotype.efae_GCF_900639545.tsv 1> test_workdir/03.results/efae_GCF_900639545/log/efae_GCF_900639545.genotype.log 2> test_workdir/03.results/efae_GCF_900639545/log/efae_GCF_900639545.genotype.log.err ||             (cat test_workdir/03.results/efae_GCF_900639545/log/efae_GCF_900639545.genotype.log.err; exit 1)
        fi
        ' returned non-zero exit status 1.
  File "/home/mk/miniconda3/envs/mgefinder/lib/python3.6/site-packages/mgefinder/workflow/denovo.original.Snakefile", line 286, in __rule_genotype
  File "/home/mk/miniconda3/envs/mgefinder/lib/python3.6/concurrent/futures/thread.py", line 56, in run
Will exit after finishing currently running jobs.
Exiting because a job execution failed. Look above for error message

test_workdir/03.results/efae_GCF_900639545/log directory is empty.

Could I use it on metagenomics data？

Hi，
I want to detect MGE（mobile genetic elements）in the contigs，but I don't know whether ‘MGEfinder’ can solve it？

Categories of confidence level for the identified insertion sequence

Hi, durrantmm.

Could you tell us what IAwoC and ArSC mean?

I found the 4 words ( IAwFC, IAwoC, IDB, ArSC) used to specify confidence level of the identified insertion sequence in the file 02.genotype..tsv.
In the user manual, IAwFC and IDB are explained, and I could not find the above two.

My purpose is strain genotyping based on polymorphism of the inserted position of MGEs by using resequencing data.

Additionally, if possible, could you recommend or suggest any tools for the analysis of strain genotyping based on 02.genotype..tsv. I especially want to know which strain belongs to which cluster consist of strains harboring an identical MGE profile.

Many thanks for your kind support.

cutadapt: error: FASTQ file ended prematurely Cutadapt terminated with exit signal: '256'.

Hello, a very meaningful tool. I have two questions to ask:

Can I use Fastp software instead of SuperDeduper and trim galore to process raw sequencing data to obtain clean data for subsequent analysis? Because the company I sent for sequencing uses Fastp and processes the raw data using the following steps:
(1) Discard a paired reads if either one read contains adapter contamination; 
(2) Discard a paired reads if more than 10% of bases are uncertain in either one read;
(3) Discard a paired reads if the proportion of low quality (Phred quality <5) bases is over
50% in either one read.

2.I encountered the following problem when using the trim-gallore tool to process data. I don't know how to handle it?
(trim-galore) [KXY@zju 673]$ trim_galore --fastqc --paired 673.nodup_R1.fastq.gz 673.nodup_R2.fastq.gz --cores 8
Path to Cutadapt set as: 'cutadapt' (default)
Cutadapt seems to be working fine (tested command 'cutadapt --version')
Cutadapt version: 1.18
Could not detect version of Python used by Cutadapt from the first line of Cutadapt (but found this: >>>#!/bin/sh<<<)
Letting the (modified) Cutadapt deal with the Python version instead
pigz 2.6
Parallel gzip (pigz) detected. Proceeding with multicore (de)compression using 8 cores

Proceeding with 'pigz -p 4' for decompression
To decrease CPU usage of decompression, please install 'igzip' and run again

No quality encoding type selected. Assuming that the data provided uses Sanger encoded Phred scores (default)

AUTO-DETECTING ADAPTER TYPE

Attempting to auto-detect adapter type from the first 1 million sequences of the first file (>> 673.nodup_R1.fastq.gz <<)

Found perfect matches for the following adapter sequences:
Adapter type Count Sequence Sequences analysed Percentage
Illumina 905 AGATCGGAAGAGC 1000000 0.09
Nextera 1 CTGTCTCTTATA 1000000 0.00
smallRNA 0 TGGAATTCTCGG 1000000 0.00
Using Illumina adapter for trimming (count: 905). Second best hit was Nextera (count: 1)

Writing report to '673.nodup_R1.fastq.gz_trimming_report.txt'

SUMMARISING RUN PARAMETERS

Input filename: 673.nodup_R1.fastq.gz
Trimming mode: paired-end
Trim Galore version: 0.6.10
Cutadapt version: 1.18
Python version: could not detect
Number of cores used for trimming: 8
Quality Phred score cutoff: 20
Quality encoding type selected: ASCII+33
Adapter sequence: 'AGATCGGAAGAGC' (Illumina TruSeq, Sanger iPCR; auto-detected)
Maximum trimming error rate: 0.1 (default)
Minimum required adapter overlap (stringency): 1 bp
Minimum required sequence length for both reads before a sequence pair gets removed: 20 bp
Running FastQC on the data once trimming has completed
Output file(s) will be GZIP compressed

Cutadapt seems to be reasonably up-to-date. Setting -j 8
Writing final adapter and quality trimmed output to 673.nodup_R1_trimmed.fq.gz

Now performing quality (cutoff '-q 20') and adapter trimming in a single pass for the adapter sequence: 'AGATCGGAAGAGC' from file 673.nodup_R1.fastq.gz <<<
This is cutadapt 1.18 with Python 3.7.12
Command line parameters: -j 8 -e 0.1 -q 20 -O 1 -a AGATCGGAAGAGC 673.nodup_R1.fastq.gz
Processing reads on 8 cores in single-end mode ...
ERROR: Traceback (most recent call last):
File "/data/users/KXY/miniconda3/envs/trim-galore/lib/python3.7/site-packages/cutadapt/pipeline.py", line 412, in reader_process
pipe.send_bytes(chunk)
File "/data/users/KXY/miniconda3/envs/trim-galore/lib/python3.7/site-packages/xopen/init.py", line 88, in exit
self.close()
File "/data/users/KXY/miniconda3/envs/trim-galore/lib/python3.7/site-packages/xopen/init.py", line 215, in close
self._raise_if_error()
File "/data/users/KXY/miniconda3/envs/trim-galore/lib/python3.7/site-packages/xopen/init.py", line 231, in _raise_if_error
raise IOError(message)
OSError

ERROR: Traceback (most recent call last):
File "/data/users/KXY/miniconda3/envs/trim-galore/lib/python3.7/site-packages/cutadapt/pipeline.py", line 412, in reader_process
pipe.send_bytes(chunk)
File "/data/users/KXY/miniconda3/envs/trim-galore/lib/python3.7/site-packages/xopen/init.py", line 88, in exit
self.close()
File "/data/users/KXY/miniconda3/envs/trim-galore/lib/python3.7/site-packages/xopen/init.py", line 215, in close
self._raise_if_error()
File "/data/users/KXY/miniconda3/envs/trim-galore/lib/python3.7/site-packages/xopen/init.py", line 231, in _raise_if_error
raise IOError(message)
OSError

ERROR: Traceback (most recent call last):
File "/data/users/KXY/miniconda3/envs/trim-galore/lib/python3.7/site-packages/cutadapt/pipeline.py", line 486, in run
(n, bp1, bp2) = self._pipeline.process_reads()
File "/data/users/KXY/miniconda3/envs/trim-galore/lib/python3.7/site-packages/cutadapt/pipeline.py", line 230, in process_reads
for read in self._reader:
File "src/cutadapt/_seqio.pyx", line 176, in iter
cutadapt.seqio.FormatError: FASTQ file ended prematurely

cutadapt: error: FASTQ file ended prematurely

Cutadapt terminated with exit signal: '256'.
Terminating Trim Galore run, please check error message(s) to get an idea what went wrong...

Error in job genotype while creating output file

Hi all

I am trying to run test files of MGEfinder but I got this error

CHECKING DEPENDENCIES

Current version of snakemake: 3.13.3
Expected version of snakemake: 3.13.3
Current version of einverted: EMBOSS:6.6.0.0
Expected version of einverted: EMBOSS:6.6.0.0
Current version of bowtie2: 2.3.5
Expected version of bowtie2: 2.3.5
Current version of samtools: 1.9
Expected version of samtools: 1.9
Current version of cd-hit: 4.8.1
Expected version of cd-hit: 4.8.1
:
:
Error in job genotype while creating output file
test_workdir/03.results/efae_GCF_900639545/02.genotype.efae_GCF_900639545.tsv.
RuleException:
CalledProcessError in line 286 of /Users/mo/miniconda3/envs/mgefinder/lib/python3.6/site-packages/mgefinder/workflow/denovo.original.Snakefile:
Command '
if [ "True" == "True" ]; then
mgefinder genotype --filter-clusters-inferred-assembly test_workdir/03.results/efae_GCF_900639545/01.clusterseq.efae_GCF_900639545.tsv test_workdir/01.mgefinder/efae_GCF_900639545/efae_GCF_900639545.all_pair.txt -o test_workdir/03.results/efae_GCF_900639545/02.genotype.efae_GCF_900639545.tsv 1> test_workdir/03.results/efae_GCF_900639545/log/efae_GCF_900639545.genotype.log 2> test_workdir/03.results/efae_GCF_900639545/log/efae_GCF_900639545.genotype.log.err || (cat test_workdir/03.results/efae_GCF_900639545/log/efae_GCF_900639545.genotype.log.err; exit 1)
else
mgefinder genotype --no-filter-clusters-inferred-assembly test_workdir/03.results/efae_GCF_900639545/01.clusterseq.efae_GCF_900639545.tsv test_workdir/01.mgefinder/efae_GCF_900639545/efae_GCF_900639545.all_pair.txt -o test_workdir/03.results/efae_GCF_900639545/02.genotype.efae_GCF_900639545.tsv 1> test_workdir/03.results/efae_GCF_900639545/log/efae_GCF_900639545.genotype.log 2> test_workdir/03.results/efae_GCF_900639545/log/efae_GCF_900639545.genotype.log.err || (cat test_workdir/03.results/efae_GCF_900639545/log/efae_GCF_900639545.genotype.log.err; exit 1)
fi
' returned non-zero exit status 1.
File "/Users/mo/miniconda3/envs/mgefinder/lib/python3.6/site-packages/mgefinder/workflow/denovo.original.Snakefile", line 286, in __rule_genotype
File "/Users/mo/miniconda3/envs/mgefinder/lib/python3.6/concurrent/futures/thread.py", line 56, in run
Will exit after finishing currently running jobs.
Exiting because a job execution failed. Look above for error message
:

:
subprocess.CalledProcessError: Command 'snakemake -s /Users/mo/miniconda3/envs/mgefinder/lib/python3.6/site-packages/mgefinder/workflow/denovo.original.Snakefile --config wd=test_workdir/ memory=16000 --cores 1 --configfile /Users/mo/miniconda3/envs/mgefinder/lib/python3.6/site-packages/mgefinder/workflow/denovo.original.config.yml ' returned non-zero exit status 1.

Installation without conda or singularity

Hi,

On many HPC clusters, conda is not supported for several reasons and singularity is kind of the last resort if nothing else works.

Would you mind to at least provide a list of the required dependencies for MGEfinder on the wiki of this github repository?

Then people would have a choice to decide if they want to use conda, singularity or if they provided the dependencies in a different way. I don't ask you to support this way of installing the software actively, but please at least provide the list of dependencies.

Best regards

Sam

Problem with click and locale

I'm having a locale-related problem with Click (I think). Not sure how to remedy this, setting LANG to en_US.UTF-8 didn't seem to work.

I installed mgefinder using the conda instructions (install.sh)

I'm guessing there's something wierd about my environment, but I'm not sure where to start.

$ mgefinder --help
#### CHECKING DEPENDENCIES ####
Current version of snakemake: 3.13.3
Expected version of snakemake: 3.13.3
Current version of einverted: EMBOSS:6.6.0.0
Expected version of einverted: EMBOSS:6.6.0.0
Current version of bowtie2: 2.3.5
Expected version of bowtie2: 2.3.5
Current version of samtools: 1.9
Expected version of samtools: 1.9
Current version of cd-hit: 4.8.1
Expected version of cd-hit: 4.8.1
###############################
Traceback (most recent call last):
  File "/panfs/pan1.be-md.ncbi.nlm.nih.gov/gpipe/home/aprasad/mydata/PD-3134/miniconda3/envs/mgefinder/bin/mgefinder", line 8, in <module>
    sys.exit(cli())
  File "/panfs/pan1.be-md.ncbi.nlm.nih.gov/gpipe/home/aprasad/mydata/PD-3134/miniconda3/envs/mgefinder/lib/python3.6/site-packages/click/core.py", line 764, in __call__
    return self.main(*args, **kwargs)
  File "/panfs/pan1.be-md.ncbi.nlm.nih.gov/gpipe/home/aprasad/mydata/PD-3134/miniconda3/envs/mgefinder/lib/python3.6/site-packages/click/core.py", line 696, in main
    _verify_python3_env()
  File "/panfs/pan1.be-md.ncbi.nlm.nih.gov/gpipe/home/aprasad/mydata/PD-3134/miniconda3/envs/mgefinder/lib/python3.6/site-packages/click/_unicodefun.py", line 124, in _verify_python3_env
    ' mitigation steps.' + extra
RuntimeError: Click will abort further execution because Python 3 was configured to use ASCII as encoding for the environment. Consult https://click.palletsprojects.com/en/7.x/python3/ for mitigation steps.

This system lists a couple of UTF-8 supporting locales that
you can pick from.  The following suitable locales were
discovered: aa_DJ.utf8, aa_ER.utf8, aa_ET.utf8, af_ZA.utf8, am_ET.utf8, an_ES.utf8, ar_AE.utf8, ar_BH.utf8, ar_DZ.utf8, ar_EG.utf8, ar_IN.utf8, ar_IQ.utf8, ar_JO.utf8, ar_KW.utf8, ar_LB.utf8, ar_LY.utf8, ar_MA.utf8, ar_OM.utf8, ar_QA.utf8, ar_SA.utf8, ar_SD.utf8, ar_SY.utf8, ar_TN.utf8, ar_YE.utf8, as_IN.utf8, ast_ES.utf8, ayc_PE.utf8, az_AZ.utf8, be_BY.utf8, bem_ZM.utf8, ber_DZ.utf8, ber_MA.utf8, bg_BG.utf8, bho_IN.utf8, bn_BD.utf8, bn_IN.utf8, bo_CN.utf8, bo_IN.utf8, br_FR.utf8, brx_IN.utf8, bs_BA.utf8, byn_ER.utf8, ca_AD.utf8, ca_ES.utf8, ca_FR.utf8, ca_IT.utf8, crh_UA.utf8, cs_CZ.utf8, csb_PL.utf8, cv_RU.utf8, cy_GB.utf8, da_DK.utf8, de_AT.utf8, de_BE.utf8, de_CH.utf8, de_DE.utf8, de_LU.utf8, doi_IN.utf8, dv_MV.utf8, dz_BT.utf8, el_CY.utf8, el_GR.utf8, en_AG.utf8, en_AU.utf8, en_BW.utf8, en_CA.utf8, en_DK.utf8, en_GB.utf8, en_HK.utf8, en_IE.utf8, en_IN.utf8, en_NG.utf8, en_NZ.utf8, en_PH.utf8, en_SG.utf8, en_US.utf8, en_ZA.utf8, en_ZM.utf8, en_ZW.utf8, es_AR.utf8, es_BO.utf8, es_CL.utf8, es_CO.utf8, es_CR.utf8, es_CU.utf8, es_DO.utf8, es_EC.utf8, es_ES.utf8, es_GT.utf8, es_HN.utf8, es_MX.utf8, es_NI.utf8, es_PA.utf8, es_PE.utf8, es_PR.utf8, es_PY.utf8, es_SV.utf8, es_US.utf8, es_UY.utf8, es_VE.utf8, et_EE.utf8, eu_ES.utf8, fa_IR.utf8, ff_SN.utf8, fi_FI.utf8, fil_PH.utf8, fo_FO.utf8, fr_BE.utf8, fr_CA.utf8, fr_CH.utf8, fr_FR.utf8, fr_LU.utf8, fur_IT.utf8, fy_DE.utf8, fy_NL.utf8, ga_IE.utf8, gd_GB.utf8, gez_ER.utf8, gez_ET.utf8, gl_ES.utf8, gu_IN.utf8, gv_GB.utf8, ha_NG.utf8, he_IL.utf8, hi_IN.utf8, hne_IN.utf8, hr_HR.utf8, hsb_DE.utf8, ht_HT.utf8, hu_HU.utf8, hy_AM.utf8, ia_FR.utf8, id_ID.utf8, ig_NG.utf8, ik_CA.utf8, is_IS.utf8, it_CH.utf8, it_IT.utf8, iu_CA.utf8, iw_IL.utf8, ja_JP.utf8, ka_GE.utf8, kk_KZ.utf8, kl_GL.utf8, km_KH.utf8, kn_IN.utf8, ko_KR.utf8, kok_IN.utf8, ks_IN.utf8, ku_TR.utf8, kw_GB.utf8, ky_KG.utf8, lb_LU.utf8, lg_UG.utf8, li_BE.utf8, li_NL.utf8, lij_IT.utf8, lo_LA.utf8, lt_LT.utf8, lv_LV.utf8, mag_IN.utf8, mai_IN.utf8, mg_MG.utf8, mhr_RU.utf8, mi_NZ.utf8, mk_MK.utf8, ml_IN.utf8, mn_MN.utf8, mni_IN.utf8, mr_IN.utf8, ms_MY.utf8, mt_MT.utf8, my_MM.utf8, nb_NO.utf8, nds_DE.utf8, nds_NL.utf8, ne_NP.utf8, nhn_MX.utf8, niu_NU.utf8, niu_NZ.utf8, nl_AW.utf8, nl_BE.utf8, nl_NL.utf8, nn_NO.utf8, nr_ZA.utf8, nso_ZA.utf8, oc_FR.utf8, om_ET.utf8, om_KE.utf8, or_IN.utf8, os_RU.utf8, pa_IN.utf8, pa_PK.utf8, pap_AN.utf8, pl_PL.utf8, ps_AF.utf8, pt_BR.utf8, pt_PT.utf8, ro_RO.utf8, ru_RU.utf8, ru_UA.utf8, rw_RW.utf8, sa_IN.utf8, sat_IN.utf8, sc_IT.utf8, sd_IN.utf8, se_NO.utf8, shs_CA.utf8, si_LK.utf8, sid_ET.utf8, sk_SK.utf8, sl_SI.utf8, so_DJ.utf8, so_ET.utf8, so_KE.utf8, so_SO.utf8, sq_AL.utf8, sq_MK.utf8, sr_ME.utf8, sr_RS.utf8, ss_ZA.utf8, st_ZA.utf8, sv_FI.utf8, sv_SE.utf8, sw_KE.utf8, sw_TZ.utf8, szl_PL.utf8, ta_IN.utf8, ta_LK.utf8, te_IN.utf8, tg_TJ.utf8, th_TH.utf8, ti_ER.utf8, ti_ET.utf8, tig_ER.utf8, tk_TM.utf8, tl_PH.utf8, tn_ZA.utf8, tr_CY.utf8, tr_TR.utf8, ts_ZA.utf8, tt_RU.utf8, ug_CN.utf8, uk_UA.utf8, unm_US.utf8, ur_IN.utf8, ur_PK.utf8, ve_ZA.utf8, vi_VN.utf8, wa_BE.utf8, wae_CH.utf8, wal_ET.utf8, wo_SN.utf8, xh_ZA.utf8, yi_US.utf8, yo_NG.utf8, yue_HK.utf8, zh_CN.utf8, zh_HK.utf8, zh_SG.utf8, zh_TW.utf8, zu_ZA.utf8

$ echo $LANG
en_US.UTF-8

Hello there i'm trying to run a test on Non-toxin CD 's MGE, but i encounter an error which i can't slove

like this , does it means my contents in the .fna file is not fit the workflow? i am able to finish running the 2.4G files used in tutorial ,
my workdir is like this:
----BJ22012
----00.assembly
----00.bam
----00.genome
bam and bam.bai files was created as tutorial showed but assembly file and genome file was using a assembled but not known which approach file and not sure if it is fit for the workflow , if workflow itself needs a specific format on contents please give a short sample , thanks for anyone who could help.

No output file, no error message

Hello, I have tried running mgefinder on multiple data sets. Some have worked and some have not. My latest run produced this:

CHECKING DEPENDENCIES

PARAMETERS

command: workflow
workdir: /scratch/aubksw/MGEfinder/MGEfinder/cluster5
cores: 1
memory: 16000
unlock: False
rerun_incomplete: False
keep_going: False
sensitive: True
####################
COMMAND: snakemake -s /home/aubksw/anaconda3/envs/mgefinder/lib/python3.6/site-packages/mgefinder/workflow/denovo.sensitive.Snakefile --config wd=/scratch/aubksw/MGEfinder/MGEfinder/cluster5 memory=16000 --cores 1 --configfile /home/aubksw/anaconda3/envs/mgefinder/lib/python3.6/site-packages/mgefinder/workflow/denovo.sensitive.config.yml
Provided cores: 1
Rules claiming more threads will be scaled down.
Job counts:
count jobs
1 all
1

rule all:
jobid: 0

Finished job 0.
1 of 1 steps (100%) done

There were no output files created and no error messages so I am not sure what went wrong. Please let me know if there is a solution. Thank you

unable to install using install.sh script - package dependencies not available?

I am trying to download and activate mgefinder on a Mac Ventura 13.2.1 with an Apple M2 Max chip.

I receive this error every time I try to run the install script -- even after adding conda forge and other channels to my conda environment, the same error appears. I am a new with coding, so this may be an easy fix but I have not been able to find a work around. Thanks!

"(base) mafuller@JV25QX0JKV MGEfinder % bash install.sh
Removing mgefinder environment if already installed...
Installing mgefinder environment...
Channels:

bioconda
defaults
conda-forge
Platform: osx-arm64
Collecting package metadata (repodata.json): done
Solving environment: failed

PackagesNotFoundError: The following packages are not available from current channels:

zstd==1.3.7=h5bba6e5_0
zlib==1.2.11=h1de35cc_3
yaml==0.1.7=hc338f04_2
xz==5.2.4=h1de35cc_4
wrapt==1.11.2=py36h1de35cc_0
wheel==0.33.6=py36_0
urllib3==1.25.7=py36_0
tk==8.6.8=ha441bb4_0
tbb==2019.8=h04f5b5a_0
sqlite==3.30.1=ha441bb4_0
snakemake==3.13.3=py36_0
six==1.13.0=py36_0
setuptools==42.0.2=py36_0
samtools==1.9=h8aa4d43_12
requests==2.22.0=py36_1
readline==7.0=h1de35cc_5
pyyaml==5.2=py36h1de35cc_0
python==3.6.9=h359304d_0
pysocks==1.7.1=py36_0
pysftp==0.2.9=py36_0
pysam==0.15.3=py36h726f235_1
pyopenssl==19.1.0=py36_0
pynacl==1.3.0=py36h1de35cc_0
pycparser==2.19=py36_0
psutil==5.6.7=py36h1de35cc_0
pip==19.3.1=py36_0
perl==5.26.2=h4e221da_0
paramiko==2.6.0=py36_0
pandas==0.25.3=py36h0a44026_0
openssl==1.1.1d=h1de35cc_3
numpy-base==1.17.4=py36h6575580_0
numpy==1.17.4=py36h890c691_0
ncurses==6.1=h0a44026_1
mkl_random==1.1.0=py36ha771720_0
mkl_fft==1.0.15=py36h5e564d8_0
mkl-service==2.3.0=py36hfbe908c_0
mkl==2019.4=233
libxml2==2.9.9=hf6e021a_1
libwebp==1.0.1=hd73b212_0
libtiff==4.1.0=hcb84e12_0
libssh2==1.8.2=ha12b0ac_0
libsodium==1.0.16=h3efe00b_0
libpng==1.6.37=ha441bb4_0
libiconv==1.15=hdd342a3_7
libgfortran==3.0.1=h93005f0_2
libgd==2.2.5=h527e5b3_3
libffi==3.2.1=h475c297_4
libedit==3.1.20181209=hb402a30_0
libdeflate==1.0=h1de35cc_1
libcxxabi==4.0.1=hcfea43d_1
libcxx==4.0.1=hcfea43d_1
libcurl==7.67.0=h051b688_0
krb5==1.16.4=hddcf347_0
jpeg==9b=he5867d9_2
intel-openmp==2019.4=233
idna==2.8=py36_0
icu==58.2=h4b95b61_1
htslib==1.9=h3a161e8_7
giflib==5.1.4=h1de35cc_1
ftputil==3.2=py36_0
freetype==2.9.1=hb4e5f40_0
fontconfig==2.13.0=h5d5b041_1
filechunkio==1.6=py36_0
expat==2.2.6=h0a44026_0
emboss==6.6.0=h6debe1e_0
dropbox==5.2.1=py36_0
docutils==0.15.2=py36_0
curl==7.67.0=ha441bb4_0
cryptography==2.8=py36ha12b0ac_0
chardet==3.0.4=py36_1003
cffi==1.13.2=py36hb5b8e2f_0
certifi==2019.11.28=py36_0
cd-hit==4.8.1=hd9629dc_0
ca-certificates==2019.11.27=0
bzip2==1.0.8=h1de35cc_0
bwa==0.7.17=h2573ce8_7
bowtie2==2.3.5=py36h5c9b4e4_0
blas==1.0=mkl
bcrypt==3.1.7=py36h1de35cc_0
asn1crypto==1.2.0=py36_0

Current channels:

To search for alternate channels that may provide the conda package you're
looking for, navigate to

https://anaconda.org

and use the search bar at the top of the page.

Installation Complete.

Before running mgefinder, activate the mgefinder environment with

conda activate mgefinder

You can then run mgefinder by typing

mgefinder [command]
Send any questions to [email protected]"

mgefinder issue

mgefinder is giving me an error code I don't understand. As far as I can tell the input files follow the specifications and everything should be up to date, loaded or updated in the last week. Entire screen dump in in attached file.
mge_screen-out.txt

Pertinent part would seem to be:
click.echo('Loading file {num1}/{num2}: {f}'.format(num1=1, num2=len(inferseq_files), f=inferseq_files[0]))
IndexError: list index out of range
Error in job clusterseq while creating output file /home/rick/Documents/Campy-mge-test/03.results/AL111168/01.clusterseq.AL111168.tsv.
RuleException:
CalledProcessError in line 259 of /home/rick/anaconda3/envs/mgefinder/lib/python3.6/site-packages/mgefinder/workflow/denovo.original.Snakefile:
Command '
mgefinder clusterseq -minsize 70 -maxsize 200000 --threads 1 --memory 50000 /home/rick/Documents/Campy-mge-test/01.mgefinder/AL111168/AL111168.all_inferseq_database.txt -o /home/rick/Documents/Campy-mge-test/03.results/AL111168/01.clusterseq.AL111168.tsv
' returned non-zero exit status 1.
File "/home/rick/anaconda3/envs/mgefinder/lib/python3.6/site-packages/mgefinder/workflow/denovo.original.Snakefile", line 259, in __rule_clusterseq
File "/home/rick/anaconda3/envs/mgefinder/lib/python3.6/concurrent/futures/thread.py", line 56, in run
Will exit after finishing currently running jobs.
Exiting because a job execution failed. Look above for error message

Thank you for any help,
Rick

the sum of unique clusters in "01.clusterseq" file does not match the number of unique clusters in "03.sum_cluseter" file

Hi,
Thank you to make this great tool.
I finally get the 03. results folder. But when I check the number of unique clusters in "01.clusterseq.GCA_000210735.tsv", I found the number is not the same as the number of clusters in 03.summarize.GCA_000210735.clusters.tsv. For example, 1331 vs 1234. The number of groups is also the same case. Besides, the number of unique inferred_seq in "01.clusterseq.GCA_000210735.tsv" is also not the same as the number of contigs in "04.makefasta.GCA_000210735.all_seqs.fna". Do you have any explanation for this? Thanks a lot!

Using MGEfinder with multiple species

This is a question more than an issue. I am interested in looking for MGEs shared across multiple species of staph. In doing so, should I map all the reads with BWA to a single ref species (one which shares the most genes with all others), or do BWA mapping to the individual reference for each specie? I guess my question is are MGEs identified across different species using species specific reference species going to be comparable?

Thanks in advance for your input!

KeyError

Hello, do you know what is causing this error and how to fix it?

KeyError: 'emb|FRDD01000003.1|'
Error in job pair while creating output file /scratch/aubksw/MGEfinder/MGEfinder/cluster4/01.mgefinder/XretroflexusSp953/Xsp60/02.pair.Xsp60.XretroflexusSp953.tsv.
RuleException:
CalledProcessError in line 109 of /home/aubksw/anaconda3/envs/mgefinder/lib/python3.6/site-packages/mgefinder/workflow/denovo.original.Snakefile:
Command '
mgefinder pair -maxdr 20 -minq 20 -minial 21 -maxjsp 0.15 -lins 30 /scratch/aubksw/MGEfinder/MGEfinder/cluster4/01.mgefinder/XretroflexusSp953/Xsp60/01.find.Xsp60.XretroflexusSp953.tsv $
' returned non-zero exit status 1.

Question about python version required

Hello!
I installed MGEfinder using Method 1 as per the installation instructions and everything went fine.
I ran (after activating the conda env):

$ mgefinder find <bam alignment>.bam

and was able to obtain the *.tsv with correct looking output.
Following this I ran:

$ mgefinder pair <tsv from mgefinder find>.tsv <bam alignment>.bam <ref genome>.fasta

This command was unsuccessful and the output was:

#### CHECKING DEPENDENCIES ####
Current version of snakemake: 3.13.3
Expected version of snakemake: 3.13.3
Current version of einverted: EMBOSS:6.6.0.0
Expected version of einverted: EMBOSS:6.6.0.0
Current version of bowtie2: 2.3.5
Expected version of bowtie2: 2.3.5
Current version of samtools: 1.9
Expected version of samtools: 1.9
Current version of cd-hit: 4.8.1
Expected version of cd-hit: 4.8.1
###############################
#### PARAMETERS ####
command: pair
findfile: mgefinder.find.tsv
bamfile: EC_IC_20_1X_MinION_Hybrid.align.sorted.bam
genome: EC_IC_20_1X_MinION_Hybrid.assembly.fasta
max_direct_repeat_length: 20
min_alignment_quality: 20
min_alignment_inner_length: 21
max_junction_spanning_prop: 0.15
large_insertion_cutoff: 30
output_file: mgefinder.pairs.tsv
####################
Finding all flank pairs within 20 bases of each other ...
Finding all inverted repeats at termini in 8 candidate pairs...
Assigning pairs according to existence of inverted repeats, read count difference, and flank length difference...
Traceback (most recent call last):
  File "/home/reedrich/miniconda3/envs/mgefinder/bin/mgefinder", line 8, in <module>
    sys.exit(cli())
  File "/home/reedrich/miniconda3/envs/mgefinder/lib/python3.6/site-packages/click/core.py", line 764, in __call__
    return self.main(*args, **kwargs)
  File "/home/reedrich/miniconda3/envs/mgefinder/lib/python3.6/site-packages/click/core.py", line 717, in main
    rv = self.invoke(ctx)
  File "/home/reedrich/miniconda3/envs/mgefinder/lib/python3.6/site-packages/click/core.py", line 1137, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/home/reedrich/miniconda3/envs/mgefinder/lib/python3.6/site-packages/click/core.py", line 956, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/reedrich/miniconda3/envs/mgefinder/lib/python3.6/site-packages/click/core.py", line 555, in invoke
    return callback(*args, **kwargs)
  File "/home/reedrich/miniconda3/envs/mgefinder/lib/python3.6/site-packages/mgefinder/main.py", line 123, in pair
    min_alignment_inner_length, max_junction_spanning_prop, large_insertion_cutoff, output_file)
  File "/home/reedrich/miniconda3/envs/mgefinder/lib/python3.6/site-packages/mgefinder/pair.py", line 40, in _pair
    flank_pairs = flank_pairer.run_pair_flanks()
  File "/home/reedrich/miniconda3/envs/mgefinder/lib/python3.6/site-packages/mgefinder/pair.py", line 99, in run_pair_flanks
    assigned_pairs = self.assign_pairs(pairs)
  File "/home/reedrich/miniconda3/envs/mgefinder/lib/python3.6/site-packages/mgefinder/pair.py", line 245, in assign_pairs
    self.get_header_list()].sort_values(['contig', 'pos_5p', 'pos_3p'])
  File "/home/reedrich/.local/lib/python3.6/site-packages/pandas/core/indexing.py", line 1761, in __getitem__
    return self._getitem_tuple(key)
  File "/home/reedrich/.local/lib/python3.6/site-packages/pandas/core/indexing.py", line 1288, in _getitem_tuple
    retval = getattr(retval, self.name)._getitem_axis(key, axis=i)
  File "/home/reedrich/.local/lib/python3.6/site-packages/pandas/core/indexing.py", line 1953, in _getitem_axis
    return self._getitem_iterable(key, axis=axis)
  File "/home/reedrich/.local/lib/python3.6/site-packages/pandas/core/indexing.py", line 1594, in _getitem_iterable
    keyarr, indexer = self._get_listlike_indexer(key, axis, raise_missing=False)
  File "/home/reedrich/.local/lib/python3.6/site-packages/pandas/core/indexing.py", line 1552, in _get_listlike_indexer
    keyarr, indexer, o._get_axis_number(axis), raise_missing=raise_missing
  File "/home/reedrich/.local/lib/python3.6/site-packages/pandas/core/indexing.py", line 1654, in _validate_read_indexer
    "Passing list-likes to .loc or [] with any missing labels "
KeyError: 'Passing list-likes to .loc or [] with any missing labels is no longer supported, see https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#deprecate-loc-reindex-listlike'

I am wondering what the source of error is. Is this due to a dependency out of data or perhaps using a different version of python than that which is required?

Thank you for your advice and time.
All the best,
-BioRRW

Error in pair step

Hi! I'm running into an error on the pair step while using MGEfinder v1.0.6. I'm using the workflow denovo command, and the working directory only produces the 01.mgefinder directory. Within the ~/01.mgefinder///log/..pair.log.err file, I get this error:

Traceback (most recent call last):
File "/home/erin.newcomer/.conda/envs/mgefinder/bin/mgefinder", line 8, in
sys.exit(cli())
File "/home/erin.newcomer/.conda/envs/mgefinder/lib/python3.6/site-packages/click/core.py", line 764, in call
return self.main(*args, **kwargs)
File "/home/erin.newcomer/.conda/envs/mgefinder/lib/python3.6/site-packages/click/core.py", line 717, in main
rv = self.invoke(ctx)
File "/home/erin.newcomer/.conda/envs/mgefinder/lib/python3.6/site-packages/click/core.py", line 1137, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/home/erin.newcomer/.conda/envs/mgefinder/lib/python3.6/site-packages/click/core.py", line 956, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/home/erin.newcomer/.conda/envs/mgefinder/lib/python3.6/site-packages/click/core.py", line 555, in invoke
return callback(*args, **kwargs)
File "/home/erin.newcomer/.conda/envs/mgefinder/lib/python3.6/site-packages/mgefinder/main.py", line 194, in pair
min_alignment_inner_length, max_junction_spanning_prop, large_insertion_cutoff, output_file)
File "/home/erin.newcomer/.conda/envs/mgefinder/lib/python3.6/site-packages/mgefinder/pair.py", line 40, in _pair
flank_pairs = flank_pairer.run_pair_flanks()
File "/home/erin.newcomer/.conda/envs/mgefinder/lib/python3.6/site-packages/mgefinder/pair.py", line 113, in run_pair_flanks
final_pairs = self.get_direct_repeats(filtered_pairs)
File "/home/erin.newcomer/.conda/envs/mgefinder/lib/python3.6/site-packages/mgefinder/pair.py", line 278, in get_direct_repeats
positions = self.get_reference_direct_repeats(flank_pairs, genome_dict)
File "/home/erin.newcomer/.conda/envs/mgefinder/lib/python3.6/site-packages/mgefinder/pair.py", line 292, in get_reference_direct_repeats
direct_repeat = genome_dict[contig][(start+1):end]
KeyError: '1'

Do you have any advice?

How to choose a reference genome

Hello, thanks for the wonderfull tool you have developed for exploring MGE! As you mentioned in your manuscript, the choice of reference genome is important (isolates should share at least 98.5% nucleotide identity with the reference genome), so before analyses, I want to consult you how should I choose my reference genome. Here I have downloaded ~200 genomes of one bacterial species, and from the phylogenetic tree I found that they were divided into five clades, so I want to do some analyses about mobile genetic elements of the five clades, and compare them.

I intend to perform MGE analysis of the five clades separately using your MGEfinder, however, I am a little confused about which genome can be used as the reference genome for each clade, can you help me? Thanks in advance.
Best,
jk yin

No Output File

mgefinder workflow denovo --cores 4 ../fastqtrim/

CHECKING DEPENDENCIES

PARAMETERS

command: workflow
workdir: ../fastqtrim/
cores: 4
memory: 16000
unlock: False
rerun_incomplete: False
keep_going: False
sensitive: False
####################
COMMAND: snakemake -s /home/biobootcamp/anaconda3/envs/mgefinder/lib/python3.6/site-packages/mgefinder/workflow/denovo.original.Snakefile --config wd=../fastqtrim/ memory=16000 --cores 4 --configfile /home/biobootcamp/anaconda3/envs/mgefinder/lib/python3.6/site-packages/mgefinder/workflow/denovo.original.config.yml
Provided cores: 4
Rules claiming more threads will be scaled down.
Job counts:
count jobs
1 all
1

rule all:
jobid: 0

Finished job 0.
1 of 1 steps (100%) done

Unable to run the workflow

Hello,
I tried installing mgefinder using both conda and pip but for either method, I encountered the following error:

mgefinder workflow denovo workdir/

Usage: mgefinder workflow [OPTIONS] WORKDIR Try 'mgefinder workflow --help' for help. Error: Invalid value for 'WORKDIR': Path 'denovo' does not exist.

And when I removed "denovo", I got another error:

mgefinder workflow workdir/

PARAMETERS

command: workflow
workdir: workdir/
snakefile: /home/styphi/sf_D/Test_dir/miniconda2/envs/mgefinder/lib/python3.7/site-packages/mgefinder/workflow/Snakefile configfile: /home/styphi/sf_D/Test_dir/miniconda2/envs/mgefinder/lib/python3.7/site-packages/mgefinder/workflow/config.yml
cores: 1
memory: 16000
unlock: False
rerun_incomplete: False
keep_going: False ###################
Traceback (most recent call last):
File "/home/styphi/sf_D/Test_dir/miniconda2/envs/mgefinder/bin/mgefinder", line 11, in sys.exit(cli())
File "/home/styphi/sf_D/Test_dir/miniconda2/envs/mgefinder/lib/python3.7/site-packages/click/core.py", line 829, in call return self.main(*args, **kwargs)
File "/home/styphi/sf_D/Test_dir/miniconda2/envs/mgefinder/lib/python3.7/site-packages/click/core.py", line 782, in main rv = self.invoke(ctx)
File "/home/styphi/sf_D/Test_dir/miniconda2/envs/mgefinder/lib/python3.7/site-packages/click/core.py", line 1259, in invoke return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/home/styphi/sf_D/Test_dir/miniconda2/envs/mgefinder/lib/python3.7/site-packages/click/core.py", line 1066, in invoke return ctx.invoke(self.callback, **ctx.params)
File "/home/styphi/sf_D/Test_dir/miniconda2/envs/mgefinder/lib/python3.7/site-packages/click/core.py", line 610, in invoke return callback(*args, **kwargs)
File "/home/styphi/sf_D/Test_dir/miniconda2/envs/mgefinder/lib/python3.7/site-packages/mgefinder/main.py", line 46, in workflow
_workflow(workdir, snakefile, configfile, cores, memory, unlock, rerun_incomplete, keep_going) File "/home/styphi/sf_D/Test_dir/miniconda2/envs/mgefinder/lib/python3.7/site-packages/mgefinder/workflow.py", line 7, in _workflow
force_incomplete=rerun_incomplete, keepgoing=keep_going) TypeError: snakemake() got an unexpected keyword argument 'configfile'

Could you please advise?

Many thanks.

Couldn't finish the tutorial

Hi,

Our group (Dr. Pamer's UChicago Lab) is very excited to find a de novo approach to detect MGE. I am running your tutorial just to see how it works. However, I ran into some issues which I have no idea how to fix. It seems there is usage error about the "click" package. Please see the following error message.

I ran the following command as you listed in the tutorial.

$ mgefinder workflow --cores 16 test_workdir/

The following is the error message.

#### PARAMETERS ###
command: workflow
workdir: test_workdir/
snakefile: /home/dfi_user/miniconda3/envs/mgefinder/lib/python3.7/site-packages/mgefinder/workflow/Snakefile
configfile: /home/dfi_user/miniconda3/envs/mgefinder/lib/python3.7/site-packages/mgefinder/workflow/config.yml
cores: 16
memory: 16000
unlock: False
rerun_incomplete: False
keep_going: False
###################
Traceback (most recent call last):
  File "/home/dfi_user/miniconda3/envs/mgefinder/bin/mgefinder", line 11, in <module>
    sys.exit(cli())
  File "/home/dfi_user/miniconda3/envs/mgefinder/lib/python3.7/site-packages/click/core.py", line 764, in __call__
    return self.main(*args, **kwargs)
  File "/home/dfi_user/miniconda3/envs/mgefinder/lib/python3.7/site-packages/click/core.py", line 717, in main
    rv = self.invoke(ctx)
  File "/home/dfi_user/miniconda3/envs/mgefinder/lib/python3.7/site-packages/click/core.py", line 1137, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/home/dfi_user/miniconda3/envs/mgefinder/lib/python3.7/site-packages/click/core.py", line 956, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/dfi_user/miniconda3/envs/mgefinder/lib/python3.7/site-packages/click/core.py", line 555, in invoke
    return callback(*args, **kwargs)
  File "/home/dfi_user/miniconda3/envs/mgefinder/lib/python3.7/site-packages/mgefinder/main.py", line 46, in workflow
    _workflow(workdir, snakefile, configfile, cores, memory, unlock, rerun_incomplete, keep_going)
  File "/home/dfi_user/miniconda3/envs/mgefinder/lib/python3.7/site-packages/mgefinder/workflow.py", line 7, in _workflow
    force_incomplete=rerun_incomplete, keepgoing=keep_going)
TypeError: snakemake() got an unexpected keyword argument 'configfile'

I am running the tutorial inside the mgefinder environment created by miniconda3. What other specs I should provide for you to identify the problem?

Also, what is the purpose of de-duplicating the raw fastq files? I couldn't get hts_SuperDeduper to work yet. Is it OK if I skip that step?

Thanks,

Eddi

Error in job make_database while creating output files

Hello,

I'm writing about some trouble that I have had running MGEfinder with my own data. I was able to complete the step-by-step tutorial without any issues-- The "mgefinder workflow denovo" command ran just fine and the correct output files were generated.

For my own data, I created a directory called "workdir", which included three directories:

00.assembly (includes illumina-based assembly file produced by SPAdes)
00.bam (generated using bwa mem and "mgefinder formatbam" to convert sam to bam)
00.genome (includes the reference genome generated using Flye)

To run mgefinder, I used the following command: mgefinder workflow denovo -t 10 workdir/

The program terminated with only 53% of the analysis completed. I've included the error file as an attachment. The issues seems to be related to the following portion of the error file:

rule make_database:
input: workdir/01.mgefinder/flye_final_polished/flye_final_polished.all_inferseq.txt
output: workdir/02.database/flye_final_polished/flye_final_polished.database.fna, workdir/02.database/flye_final_polished/flye_final_polished.database.fna.1.bt2
jobid: 12
benchmark: workdir/02.database/flye_final_polished/flye_final_polished.database.benchmark.txt
wildcards: genome=flye_final_polished
threads: 10

Waiting at most 5 seconds for missing files.
Error in job make_database while creating output files workdir/02.database/flye_final_polished/flye_final_polished.database.fna, workdir/02.database/flye_final_polished/flye_final_polished.database.fna.1.bt2.
MissingOutputException in line 192 of /scratch2/software/anaconda/envs/mgefinder/lib/python3.6/site-packages/mgefinder/workflow/denovo.original.Snakefile:
Missing files after 5 seconds:
workdir/02.database/flye_final_polished/flye_final_polished.database.fna
workdir/02.database/flye_final_polished/flye_final_polished.database.fna.1.bt2
This might be due to filesystem latency. If that is the case, consider to increase the wait time with --latency-wait.
Will exit after finishing currently running jobs.
Exiting because a job execution failed. Look above for error message

Any insight and assistance would be appreciated! Thank you!

mgefinder.sh.e180713.txt
mgefinder.sh.o180713.txt

MGE for FASTA format

I have a FASTA file assembled already, is it possible to use this file in MGEfinder to detect any MGE that is present in my sequence?

If so, what is the code I should use to run the system? Since the tutorial mostly focuses on non-assembled files..

Thanks

How to find the Insertion-Enriched Sites caused by Insertion sequences

Hi Matthew,
I’m very interested in the Analysis of Insertion-Enriched Sites in your paper.
I am trying to analyze Insertion-Enriched Sites caused by Insertion sequences (the upstream, within and downstream of the nearest CDS) among 200 complete sequenced bacteria.
I wonder if I can use the MGEfinder tool to do this?
If yes, Could you please give me some idea (or the workflow)?
Many thanks in advance.
Sincerely,
Dai Kuang

Calling mgefinder from a different workflow

I created a snakemake workflow to make the necessary files and organize them for mgefinder, but I am having difficulties merging mgefinder, which has its own environment and configfile, as a part of my workflow

I tried using subworkflow but then mgefinder runs first when I need it to run last
I was considering using include but it prevents me from using a unique environment or config file.
I was considering creating a third workflow to call both subworkflows but didn't try it yet and it would complicate my debugging
I was considering modifying your original code to include 'env' in each call but it is (a) time-consuming and (b) not ideal in terms of best practices
Finally, I exported the mgefinder environment and made a rule that calls it. It works but my workflow sees it as one job and it is certainly far from best practices:

rule mgefinder:
        input:
            lambda wildcards: 
                expand("mgefinder/{group}/00.bam/{sample2}.{sample1}.bam.bai",
                    sample1=GROUPS.get(int(wildcards.group)), 
                    sample2=GROUPS.get(int(wildcards.group)), 
                    allow_missing=True),
            lambda wildcards: 
                expand("mgefinder/{group}/00.{dirname}/{sample}.fna",
                    sample=GROUPS.get(int(wildcards.group)), 
                    dirname=["assembly","genome"],allow_missing=True),
        output:
            "mgefinder/{group}/dummy.txt"
        params:
            prefix="mgefinder/{group}/"
        conda:
            "database/mgefinder.yaml"
        shell:
            "mgefinder workflow denovo {params.prefix}; touch `{params.prefix}/dummy.txt"

I was wondering if you have any suggestions as to how to do it?

Issue in converting .sam to .bam

When I attempt to convert .sam to .bam I get the following error and no .bam file is created:

(mgefinder) -bash-4.2$ mgefinder formatbam 1027D_19.NC_014925.1.sam 1027D_19.NC_014925.1.bam
Traceback (most recent call last):
File "/network/rit/lab/andamlab/bin/miniconda3/envs/mgefinder/bin/mgefinder", line 7, in
from mgefinder.main import cli
File "/network/rit/lab/andamlab/bin/miniconda3/envs/mgefinder/lib/python3.7/site-packages/mgefinder/main.py", line 6, in
from mgefinder.pair import _pair
File "/network/rit/lab/andamlab/bin/miniconda3/envs/mgefinder/lib/python3.7/site-packages/mgefinder/pair.py", line 8, in
from mgefinder import fastatools, embosstools, pysamtools, sctools
File "/network/rit/lab/andamlab/bin/miniconda3/envs/mgefinder/lib/python3.7/site-packages/mgefinder/fastatools.py", line 6, in
from Bio.Alphabet import IUPAC
File "/network/rit/lab/andamlab/bin/miniconda3/envs/mgefinder/lib/python3.7/site-packages/Bio/Alphabet/init.py", line 21, in
"Bio.Alphabet has been removed from Biopython. In many cases, the alphabet can simply be ignored and removed from scripts. In a few cases, you may need to specify the molecule_type as an annotation on a SeqRecord for your script to work correctly. Please see https://biopython.org/wiki/Alphabet for more information."
ImportError: Bio.Alphabet has been removed from Biopython. In many cases, the alphabet can simply be ignored and removed from scripts. In a few cases, you may need to specify the molecule_type as an annotation on a SeqRecord for your script to work correctly. Please see https://biopython.org/wiki/Alphabet for more information.

Problem with clusterseq

Hi,

I am very excited about using mgefinder, but so far I cannot make it work. I successfully run the script with a test dataset. However, multiple trials with different isolates and different reference genomes gave me the same error.

Error in job clusterseq while creating output file workdir/03.results/R27/01.clusterseq.R27.tsv.
RuleException:
CalledProcessError in line 259 of /home/rozwandm/.conda/envs/mgefinder/lib/python3.6/site-packages/mgefinder/workflow/denovo.original.Snakefile:
Command 'mgefinder clusterseq -minsize 70 -maxsize 200000 --threads 1 --memory 16000 workdir/01.mgefinder/R27/R27.all_inferseq_database.txt -o workdir/03.results/R27/01.clusterseq.R27.tsv' returned non-zero exit status 1.
File "/home/rozwandm/.conda/envs/mgefinder/lib/python3.6/site-packages/mgefinder/workflow/denovo.original.Snakefile", line 259, in __rule_clusterseq
File "/home/rozwandm/.conda/envs/mgefinder/lib/python3.6/concurrent/futures/thread.py", line 56, in run
Will exit after finishing currently running jobs.
Exiting because a job execution failed. Look above for error message

Thank you in advance for your help,
Marta

Confusion in Interpreting the Results

Greetings! Firstly, I would like to express my sincere gratitude for your dedicated efforts in developing the MGEfinder software. Your work has greatly facilitated our research endeavors.

I am currently engaged in the analysis of short-read sequencing data related to the evolutionary drug resistance in Klebsiella pneumoniae. The goal is to uncover whether there have been alterations in MGEs within the strains during their evolutionary process. Presently, I have completed the analysis for 18 strains.

However, I have encountered some queries while interpreting the analysis results. In the "04.makefasta.ref.all_seqs.fna" file generated by MGEfinder, I observed only 4 sequences, whereas in the "01.clusterseq.ref.tsv" file, I noticed the presence of 455 sequences labeled as "inferred_seq." After annotating these sequences using ISfinder, all were identified as Insertion Sequence (IS) elements.

I would like to take this opportunity to seek your guidance on whether the interpretation and handling of these results are correct. Particularly, given the occurrence of only 4 sequences in the "04.makefasta.ref.all_seqs.fna" file, is there a possibility of oversight or misunderstanding in the analysis process? Your professional guidance will play a crucial role in resolving this matter, and I sincerely appreciate your assistance once again.

Best regards

Kindly be informed that, for the ease of data upload, the file type has been modified to .txt.
04.makefasta.ref.all_seqs.txt
01.clusterseq.ref.txt

Question about "Error in job make_database"

Hi all,

I would like to use MGEfinder to detect insertion sequence using the whole workflow. I prepared the dics (00.genome, 00.bam, 00.assembly)and files (sample.refer.bam etc) following the tutorial.
However, when I run mgefinder workflow /workdir
I get the error below. It would be highly appreciated if anyone could help with it. Thank you very much in advance.

Best,
Jason

How to use MGEfinder software to find MGE from assembled genome in spite of raw reads

Hi Sir,
I want to use the MGEfinder tool to find mobile genetic elements from assembled genome despite raw reads. How can I implement this program?

Can MGEfinder be used for metagenomics data of infant gut?

Hi,
As the title is shown, can MGEfinder be used for metagenomics data of infant gut microbiome? What problem do you expect if I use it? Thanks a lot!

error in step mgefinder formatbam

Hi all
please, if anyone can help me
I am trying to run this code
(mgefinder) lololly-MBP MGEfinder % mgefinder formatbam s5.sam s5.bam
but gave me this error

CHECKING DEPENDENCIES

Current version of snakemake: 3.13.3
Expected version of snakemake: 3.13.3
Current version of einverted: EMBOSS:6.6.0.0
Expected version of einverted: EMBOSS:6.6.0.0
Traceback (most recent call last):
File "/Users/lololly/miniconda3/envs/mgefinder/bin/mgefinder", line 5, in
from mgefinder.main import cli
File "/Users/lololly/miniconda3/envs/mgefinder/lib/python3.6/site-packages/mgefinder/main.py", line 38, in
check_dependencies()
File "/Users/lololly/miniconda3/envs/mgefinder/lib/python3.6/site-packages/mgefinder/dependencies.py", line 39, in check_dependencies
bowtie2_checker.check(extract_version=lambda x: x.split()[2])
File "/Users/lololly/miniconda3/envs/mgefinder/lib/python3.6/site-packages/mgefinder/dependencies.py", line 12, in check
output = shell(cmd.format(tool=self.tool), read=True).decode('utf-8').strip()
File "/Users/lololly/miniconda3/envs/mgefinder/lib/python3.6/site-packages/snakemake/shell.py", line 88, in new
raise sp.CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command 'bowtie2 --version 2>&1' returned non-zero exit status 255.

No termini found in the input file...

I'm running MGEfinder on my own data and have used the pipeline described in the readme document. This is using a complete assembly called "Ancestor.fna", a sample assembly assembled using Unicycler labelled "48con5.fna", and the bam and ba.bai files made using bwa mem followed by the mgefinder formatbam. When I run it I get the error message:

Parsing inferseq files
Combining the inferseq files...
Loading file 1/3: workdir/01.mgefinder/Ancestor/48con5/03.inferseq_assembly.48con5.Ancestor.tsv
Loading file 2/3: workdir/01.mgefinder/Ancestor/48con5/03.inferseq_reference.48con5.Ancestor.tsv
Loading file 3/3: workdir/01.mgefinder/Ancestor/48con5/03.inferseq_overlap.48con5.Ancestor.tsv
Deleting old database directory...
No termini found in the input file...
Waiting at most 5 seconds for missing files.
Error in job make_database while creating output files workdir/02.database/Ancestor/Ancestor.database.fna, workdir/02.database/Ancestor/Ancestor.database.fna.1.bt2.
MissingOutputException in line 192 of /users/steg500/.conda/envs/mgefinder/lib/python3.6/site-packages/mgefinder/workflow/denovo.original.Snakefile:
Missing files after 5 seconds:
workdir/02.database/Ancestor/Ancestor.database.fna
workdir/02.database/Ancestor/Ancestor.database.fna.1.bt2
This might be due to filesystem latency. If that is the case, consider to increase the wait time with --latency-wait.
Will exit after finishing currently running jobs.
Exiting because a job execution failed. Look above for error message

I've tried increasing the latency wait time but it doesn't recognise the --latency-wait command when I run it with mgefinder workflow denovo. Do you have any ideas how I could fix this? Thank you!

Workflow without raw short reads

Hi,

Upon recommendation from another group (Dr. Pamer's group at UChicago) I was interested in using MGEfinder to detect MGEs for a family of microbes. However, I ran into a couple of issues that I don't really know how to fix/where to start and I was hoping that you might be able to advise.

I started by running the tutorial to get a feel for the software and make sure it was running smoothly, and I was able to run the tutorial without any major problems. However, when I tried to run the program with my own files, I ran into some trouble.

I think that the main problem might be that I am trying to analyze pre-assembled reads that are already in .fna format.

Below are the steps I followed, and where I ran into problems:

Since I was working with pre-assembled contigs, I began with the alignment step and ran the following commands. (My pre-assembled contig file was named MSK_4_13_contigs.fna and the reference genome was GCF_00373885.fna (Blautia producta downloaded from NCBI)

% bwa index GCF_00373885.fna
% bwa mem GCF_000373885.fna MSK_4_13_contigs.fna > MSK_4_13_contigs.GCF_000373885.sam

Following the tutorial process, I then entered the mgefinder environment and ran

% mgefinder formatbam MSK_4_13_contigs.GCF_000373885.sam MSK_4_13_contigs.GCF_000373885.bam --single-end

Using the --single-end command since I wasn't using paired reads. And I received the output:

Removing secondary alignments...
Successfully removed secondary alignments...

Sorting the BAM file by chromosomal location...
BAM file successfully sorted...

Index the sorted BAM file...
BAM file successfully indexed...

I then assembled my working directory as follows:

test_workdir0/
    ├── 00.assembly/
    │   ├── MSK_4_13_contigs.fna
    ├── 00.bam/
    │   ├── MSK_4_13_contigs.GCF_000373885.bam
    │   ├── MSK_4_13_contigs.GCF_000373885.bam.bai
    └── 00.genome/
        └── GCF_000373885.fna

I then tried to run the workflow command in the mgefinder environment as follows

% mgefinder workflow --cores 4 --memory 50000 test_workdir0/

And at the makedatabase step, the following message appears:

#### CHECKING DEPENDENCIES ####
Current version of snakemake: 3.13.3
Expected version of snakemake: 3.13.3
Current version of einverted: EMBOSS:6.6.0.0
Expected version of einverted: EMBOSS:6.6.0.0
Current version of bowtie2: 2.3.5
Expected version of bowtie2: 2.3.5
Current version of samtools: 1.9
Expected version of samtools: 1.9
Current version of cd-hit: 4.8.1
Expected version of cd-hit: 4.8.1
###############################
#### PARAMETERS ####
command: makedatabase
inferseqfiles: ('test_workdir2/01.mgefinder/GCF_000373885/GCF_000373885.all_inferseq.txt',)
minimum_size: 30
maximum_size: 200000
threads: 1
memory: 16000
force: True
output_dir: test_workdir2/02.database/GCF_000373885
prefix: GCF_000373885.database
####################
Parsing inferseq files
Combining the inferseq files...
Loading file 1/3: test_workdir2/01.mgefinder/GCF_000373885/MSK_4_13_contigs/03.inferseq_assembly.MSK_4_13_contigs.GCF_000373885.tsv
Loading file 2/3: test_workdir2/01.mgefinder/GCF_000373885/MSK_4_13_contigs/03.inferseq_reference.MSK_4_13_contigs.GCF_000373885.tsv
Loading file 3/3: test_workdir2/01.mgefinder/GCF_000373885/MSK_4_13_contigs/03.inferseq_overlap.MSK_4_13_contigs.GCF_000373885.tsv
Deleting old database directory...
No termini found in the input file...
Waiting at most 5 seconds for missing files.
Error in job make_database while creating output files test_workdir2/02.database/GCF_000373885/GCF_000373885.database.fna, test_workdir2/02.database/GCF_000373885/GCF_000373885.database.fna.1.bt2.
MissingOutputException in line 186 of /Users/arnoldj/opt/anaconda3/envs/mgefinder/lib/python3.6/site-packages/mgefinder/workflow/Snakefile:
Missing files after 5 seconds:
test_workdir2/02.database/GCF_000373885/GCF_000373885.database.fna
test_workdir2/02.database/GCF_000373885/GCF_000373885.database.fna.1.bt2
This might be due to filesystem latency. If that is the case, consider to increase the wait time with --latency-wait.
Will exit after finishing currently running jobs.
Exiting because a job execution failed. Look above for error message
Traceback (most recent call last):
  File "/Users/arnoldj/opt/anaconda3/envs/mgefinder/bin/mgefinder", line 8, in <module>
    sys.exit(cli())
  File "/Users/arnoldj/opt/anaconda3/envs/mgefinder/lib/python3.6/site-packages/click/core.py", line 764, in __call__
    return self.main(*args, **kwargs)
  File "/Users/arnoldj/opt/anaconda3/envs/mgefinder/lib/python3.6/site-packages/click/core.py", line 717, in main
    rv = self.invoke(ctx)
  File "/Users/arnoldj/opt/anaconda3/envs/mgefinder/lib/python3.6/site-packages/click/core.py", line 1137, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/Users/arnoldj/opt/anaconda3/envs/mgefinder/lib/python3.6/site-packages/click/core.py", line 956, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/Users/arnoldj/opt/anaconda3/envs/mgefinder/lib/python3.6/site-packages/click/core.py", line 555, in invoke
    return callback(*args, **kwargs)
  File "/Users/arnoldj/opt/anaconda3/envs/mgefinder/lib/python3.6/site-packages/mgefinder/main.py", line 50, in workflow
    _workflow(workdir, snakefile, configfile, cores, memory, unlock, rerun_incomplete, keep_going)
  File "/Users/arnoldj/opt/anaconda3/envs/mgefinder/lib/python3.6/site-packages/mgefinder/workflow.py", line 26, in _workflow
    shell(cmd)
  File "/Users/arnoldj/opt/anaconda3/envs/mgefinder/lib/python3.6/site-packages/snakemake/shell.py", line 88, in __new__
    raise sp.CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command 'snakemake -s /Users/arnoldj/opt/anaconda3/envs/mgefinder/lib/python3.6/site-packages/mgefinder/workflow/Snakefile --config wd=test_workdir2 memory=16000 --cores 1 --configfile /Users/arnoldj/opt/anaconda3/envs/mgefinder/lib/python3.6/site-packages/mgefinder/workflow/config.yml' returned non-zero exit status 1.

When I try to open the files in 01.mgefinder/<genome>/ all the .tsv files are empty

Is there something that I am doing incorrectly, or is there another process that I should follow since my contigs are already pre-assembled?

I would love to be able to use MGEfinder, so any help would be very much appreciated! Thank you in advance!

Best regards,
Jack

Can I use MGEfinder to Archaea data?

Request for additional information on question #30 regarding "reference"

In the response to question #30 (also referenced in the response to question #33) you mention that "'--filter-clusters-inferred-assembly'"... "removes clusters that were never identified from an assembly, meaning they were only found in the reference."

Does the term "reference" you use in the question #30 response refer only to the genome defined as the reference when the pipeline was originally run? Or in this case can the term "reference" in the question #30 refer to any single-genome-only cluster (i.e. any cluster only originally identified in a single genome?) I want to be sure I'm understanding this correctly. In my analysis, I'm using only assembled genomes (albeit, most are draft assemblies) and I'm seeking clarification on whether self-only clusters (i.e. clusters only originally identified in a single genome would be removed under the same rules as would be done with the explicitly defined reference used when the pipeline was run. Under these conditions, every genome assembly in turn might be construed as a "reference" for the purposes of filtering as explained in the response to question #30.

I have a problem when i use method 1 to install MGEfinder!

Hello, thanks for the wonderfull tool you have developed for exploring MGE!
i failed to use the Method 1 to install MGEfinder, it shows:

(base) lkj666@Cool:~/software/MGEfinder$ bash install.sh 
Removing mgefinder environment if already installed...
Installing mgefinder environment...
Collecting package metadata (repodata.json): \ install.sh：行 13: 727048 killed               conda env create -f env/conda_linux64.yaml
Installation Complete.

Before running mgefinder, activate the mgefinder environment with
> conda activate mgefinder

You can then run mgefinder by typing
> mgefinder [command]
Send any questions to [email protected]

can you give me some advices? thank you

Sensitivity to detect IS elements

Dear durrantmm

I am using MGEfinder to detect IS6110 of Mycobacterium tuberculosis, which has been long used for DNA fingerprinting of M.tb isolates. I finished analysis of MGEfinder for hundreds of M.tb isolates, and now I am encountering some difficulties to interpret results.

Though nearly all of M.tb strains should harbor IS6110, MGEfinder detects no IS6110 insertion among 5% of my tested strains. All of them belong to lineage 1 and 4.

I noticed that MGEfinder often detects a smaller number of IS6110 than that by another reliable IS-finding tool, implying MGEfinder lacks sensitivity in my condition.

Related to 1 and 2, let me confirm one point.
M.tb has hot-spot regions where IS6110 are inserted at identical positions frequently.
If IS6110 insertion points were shared between reference genome sequence and query strains, can MGEfinder detect those shared IS6110 ?

I know you are analyzing M.tb data in your published paper, and I am grateful if you could give us any suggestions to overcome these difficulties. For example, are there any recommended parameters to improve sensitivity? The size of IS6110 is about 1300 bp and I want to focus on IS6110 insertions in this time.

I always appreciate your kind support.
Many thanks.

job pair error

Hi Matthew,

This is me again~ Hope everything is going well with you! Finally, we have some meaningful data to run through your software, but it is giving me some job pair error message.

I set up my folders as the following:

├── 00.assembly
│   ├── ST1_19.fna
│   ├── ST1_20.fna
│   └── ST1_6.fna
├── 00.bam
│   ├── ST1_19.ST1_12.bam
│   ├── ST1_19.ST1_12.bam.bai
│   ├── ST1_20.ST1_12.bam
│   ├── ST1_20.ST1_12.bam.bai
│   ├── ST1_6.ST1_12.bam
│   └── ST1_6.ST1_12.bam.bai
└── 00.genome
    └── ST1_12.fna

The error message are the following. I hope it is not about how I named the isolates.

Error in job pair while creating output file workdir/01.mgefinder/ST1_12/ST1_6/02.pair.ST1_6.ST1_12.tsv.
RuleException:
CalledProcessError in line 106 of /home/dfi_user/miniconda3/envs/mgefinder/lib/python3.6/site-packages/mgefinder/workflow/Snakefile:
Command '
        mgefinder pair -maxdr 20 -minq 20 -minial 21 -maxjsp 0.15         -lins 30 workdir/01.mgefinder/ST1_12/ST1_6/01.find.ST1_6.ST1_12.tsv workdir/00.bam/ST1_6.ST1_12.bam workdir/00.genome/ST1_12.fna -o workdir/01.mgefinder/ST1_12/ST1_6/02.pair.ST1_6.ST1_12.tsv &> workdir/01.mgefinder/ST1_12/ST1_6/log/ST1_6.ST1_12.pair.log
        ' returned non-zero exit status 1.
  File "/home/dfi_user/miniconda3/envs/mgefinder/lib/python3.6/site-packages/mgefinder/workflow/Snakefile", line 106, in __rule_pair
  File "/home/dfi_user/miniconda3/envs/mgefinder/lib/python3.6/concurrent/futures/thread.py", line 56, in run
Will exit after finishing currently running jobs.
Finished job 28.
6 of 31 steps (19%) done
Will exit after finishing currently running jobs.
Finished job 29.
7 of 31 steps (23%) done
Will exit after finishing currently running jobs.
Exiting because a job execution failed. Look above for error message
Traceback (most recent call last):
  File "/home/dfi_user/miniconda3/envs/mgefinder/bin/mgefinder", line 8, in <module>
    sys.exit(cli())
  File "/home/dfi_user/miniconda3/envs/mgefinder/lib/python3.6/site-packages/click/core.py", line 764, in __call__
    return self.main(*args, **kwargs)
  File "/home/dfi_user/miniconda3/envs/mgefinder/lib/python3.6/site-packages/click/core.py", line 717, in main
    rv = self.invoke(ctx)
  File "/home/dfi_user/miniconda3/envs/mgefinder/lib/python3.6/site-packages/click/core.py", line 1137, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/home/dfi_user/miniconda3/envs/mgefinder/lib/python3.6/site-packages/click/core.py", line 956, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/dfi_user/miniconda3/envs/mgefinder/lib/python3.6/site-packages/click/core.py", line 555, in invoke
    return callback(*args, **kwargs)
  File "/home/dfi_user/miniconda3/envs/mgefinder/lib/python3.6/site-packages/mgefinder/main.py", line 47, in workflow
    _workflow(workdir, snakefile, configfile, cores, memory, unlock, rerun_incomplete, keep_going)
  File "/home/dfi_user/miniconda3/envs/mgefinder/lib/python3.6/site-packages/mgefinder/workflow.py", line 19, in _workflow
    shell(cmd)
  File "/home/dfi_user/miniconda3/envs/mgefinder/lib/python3.6/site-packages/snakemake/shell.py", line 88, in __new__
    raise sp.CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command 'snakemake -s /home/dfi_user/miniconda3/envs/mgefinder/lib/python3.6/site-packages/mgefinder/workflow/Snakefile --config wd=workdir/ memory=16000 --cores 4 --configfile /home/dfi_user/miniconda3/envs/mgefinder/lib/python3.6/site-packages/mgefinder/workflow/config.yml' returned non-zero exit status 1.

I hope those are helpful!
Thanks in advance. Let me know if you want a copy of the files to reproduce the error message.

Eddi

Issue with pandas dependency

Hi Matthew,

I tried installing MGEfinder on Stanford's sherlock through conda and it seemed to finish successfully. However, the help command mgefinder —help results in the following error:

ModuleNotFoundError: No module named ‘pandas.compat'

I also tried installing a more up-to-date version of pandas, but still produces the same error. I figured you might be able to help me, considering that you probably used sherlock to develop the tool. Do you know what might be going on?

Thanks!

Analysis of assembled genome

Hello,
I'm wondering if it's possible to use as entry point for analysis an assembled genome.

Thanks,

Theo Dreher

Issue With Latency-Wait

Hi,

I am trying to use MGEfinder but I keep getting this error:

Error in job make_database while creating output files sample5dir/02.database/v583_ncbi_genome/v583_ncbi_genome.database.fna, sample5dir/02.database/v583_ncbi_genome/v583_ncbi_genome.database.fna.1.bt2.
MissingOutputException in line 192 of /Users/duerkoplab/opt/miniconda3/envs/mgefinder/lib/python3.6/site-packages/mgefinder/workflow/denovo.original.Snakefile:
Missing files after 5 seconds:
sample5dir/02.database/v583_ncbi_genome/v583_ncbi_genome.database.fna
sample5dir/02.database/v583_ncbi_genome/v583_ncbi_genome.database.fna.1.bt2
This might be due to filesystem latency. If that is the case, consider to increase the wait time with --latency-wait.

I can't find anything in the manual about --latency-wait. Can you please advise on what I should do?

Thanks!

Can MGEfinder detect MGEs in the presence of strain diversity?

Hi all,

I was wondering if MGEfinder can be used in samples that are not fully clonal? e.g. have you tested whether MGEfinder can detect MGEs with different levels of "purity" of the sample?

Specifically, my use case would be data from experimental evolution carried in a natural environment, where populations of a given strain are sequenced after a given time of evolution. To be more specific, the design is similar to this, where mice (that contain a microbiota) are colonized with a bacterial clone, which is allowed to evolve within different mice for a specific period. After this period, faecal samples are plated in selective media for the clone of interest and all colonies growing on a plate are scrapped and sequenced, which is where it differs from the standard MGEfinder use case.

Multiple experiments have shown that bacteria readily adapt and we can detect SNPs/ISs, etc. Though, as this environment contains multiple species, HGT is a possibility. So the data contains substantial microvariation, but the genomes constituting these populations are not as different as if we were to sample different clones from different people.

One thing I worry is if the SPADES assembly step will basically lead to a consensus sequence that ignores the microvariation present in the population. Therefore, I was wondering if it is possible to use metaSPADES for the assembly process (assuming this would allow us to keep more of that microvariation)?

PS. sorry for the rambling post

Error with test dataset

Hello!

I'm interested in using mgefinder on our datasets and followed instructions to install through conda per the guide. I downloaded and extracted the test_workdir files as instructed. I set the environment appropriately for mgefinder in conda and invoked the following command:

$ mgefinder workflow --cores 20 --memory 100000 test_workdir/

However, it appears to have crashed with the following error:

Traceback (most recent call last):
File "/home/user/miniconda3/envs/mgefinder/bin/mgefinder", line 8, in
sys.exit(cli())
File "/home/user/miniconda3/envs/mgefinder/lib/python3.6/site-packages/click/core.py", line 764, in call
return self.main(*args, **kwargs)
File "/home/user/miniconda3/envs/mgefinder/lib/python3.6/site-packages/click/core.py", line 717, in main
rv = self.invoke(ctx)
File "/home/user/miniconda3/envs/mgefinder/lib/python3.6/site-packages/click/core.py", line 1137, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/home/user/miniconda3/envs/mgefinder/lib/python3.6/site-packages/click/core.py", line 956, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/home/user/miniconda3/envs/mgefinder/lib/python3.6/site-packages/click/core.py", line 555, in invoke
return callback(*args, **kwargs)
File "/home/user/miniconda3/envs/mgefinder/lib/python3.6/site-packages/mgefinder/main.py", line 251, in genotype
_genotype(clusterseq, pairfiles, filter_clusters_inferred_assembly, output_file)
File "/home/user/miniconda3/envs/mgefinder/lib/python3.6/site-packages/mgefinder/genotype.py", line 37, in _genotype
genotypes = genotyper.genotype()
File "/home/user/miniconda3/envs/mgefinder/lib/python3.6/site-packages/mgefinder/genotype.py", line 106, in genotype
genotypes = self.resolve_ambiguous_genotypes(genotypes)
File "/home/user/miniconda3/envs/mgefinder/lib/python3.6/site-packages/mgefinder/genotype.py", line 224, in resolve_ambiguous_genotypes
unresolved, cluster_counts_per_site
File "/home/user/miniconda3/envs/mgefinder/lib/python3.6/site-packages/mgefinder/genotype.py", line 322, in resolve_all_sample_comparison
resolved = (pd.merge(unresolved, cluster_counts, how='inner', on=['contig', 'pos_5p', 'pos_3p', 'cluster']).
File "/home/user/.local/lib/python3.6/site-packages/pandas/core/reshape/merge.py", line 61, in merge
validate=validate)
File "/home/user/.local/lib/python3.6/site-packages/pandas/core/reshape/merge.py", line 555, in init
self._maybe_coerce_merge_keys()
File "/home/user/.local/lib/python3.6/site-packages/pandas/core/reshape/merge.py", line 986, in _maybe_coerce_merge_keys
raise ValueError(msg)
ValueError: You are trying to merge on object and int64 columns. If you wish to proceed you should use pd.concat
Error in job genotype while creating output file test_workdir/03.results/efae_GCF_900639545/02.genotype.efae_GCF_900639545.tsv.
RuleException:
CalledProcessError in line 286 of /home/user/miniconda3/envs/mgefinder/lib/python3.6/site-packages/mgefinder/workflow/Snakefile:
Command '
if [ "True" == "True" ]; then
mgefinder genotype --filter-clusters-inferred-assembly test_workdir/03.results/efae_GCF_900639545/01.clusterseq.efae_GCF_900639545.tsv test_workdir/01.mgefinder/efae_GCF_900639545/efae_GCF_900639545.all_pair.txt -o test_workdir/03.results/efae_GCF_900639545/02.genotype.efae_GCF_900639545.tsv 1> test_workdir/03.results/efae_GCF_900639545/log/efae_GCF_900639545.genotype.log 2> test_workdir/03.results/efae_GCF_900639545/log/efae_GCF_900639545.genotype.log.err || (cat test_workdir/03.results/efae_GCF_900639545/log/efae_GCF_900639545.genotype.log.err; exit 1)
else
mgefinder genotype --no-filter-clusters-inferred-assembly test_workdir/03.results/efae_GCF_900639545/01.clusterseq.efae_GCF_900639545.tsv test_workdir/01.mgefinder/efae_GCF_900639545/efae_GCF_900639545.all_pair.txt -o test_workdir/03.results/efae_GCF_900639545/02.genotype.efae_GCF_900639545.tsv 1> test_workdir/03.results/efae_GCF_900639545/log/efae_GCF_900639545.genotype.log 2> test_workdir/03.results/efae_GCF_900639545/log/efae_GCF_900639545.genotype.log.err || (cat test_workdir/03.results/efae_GCF_900639545/log/efae_GCF_900639545.genotype.log.err; exit 1)
fi
' returned non-zero exit status 1.
File "/home/user/miniconda3/envs/mgefinder/lib/python3.6/site-packages/mgefinder/workflow/Snakefile", line 286, in __rule_genotype
File "/home/user/miniconda3/envs/mgefinder/lib/python3.6/concurrent/futures/thread.py", line 56, in run
Will exit after finishing currently running jobs.
Exiting because a job execution failed. Look above for error message
Traceback (most recent call last):
File "/home/user/miniconda3/envs/mgefinder/bin/mgefinder", line 8, in
sys.exit(cli())
File "/home/user/miniconda3/envs/mgefinder/lib/python3.6/site-packages/click/core.py", line 764, in call
return self.main(*args, **kwargs)
File "/home/user/miniconda3/envs/mgefinder/lib/python3.6/site-packages/click/core.py", line 717, in main
rv = self.invoke(ctx)
File "/home/user/miniconda3/envs/mgefinder/lib/python3.6/site-packages/click/core.py", line 1137, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/home/user/miniconda3/envs/mgefinder/lib/python3.6/site-packages/click/core.py", line 956, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/home/user/miniconda3/envs/mgefinder/lib/python3.6/site-packages/click/core.py", line 555, in invoke
return callback(*args, **kwargs)
File "/home/user/miniconda3/envs/mgefinder/lib/python3.6/site-packages/mgefinder/main.py", line 51, in workflow
_workflow(workdir, snakefile, configfile, cores, memory, unlock, rerun_incomplete, keep_going)
File "/home/user/miniconda3/envs/mgefinder/lib/python3.6/site-packages/mgefinder/workflow.py", line 26, in _workflow
shell(cmd)
File "/home/user/miniconda3/envs/mgefinder/lib/python3.6/site-packages/snakemake/shell.py", line 88, in new
raise sp.CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command 'snakemake -s /home/user/miniconda3/envs/mgefinder/lib/python3.6/site-packages/mgefinder/workflow/Snakefile --config wd=test_workdir/ memory=16000 --cores 20 --configfile /home/user/miniconda3/envs/mgefinder/lib/python3.6/site-packages/mgefinder/workflow/config.yml ' returned non-zero exit status 1.

Obviously, would like to get the test dataset to run appropriately before trying on our own data. Most likely in my experience this is something simple but my relative inexperience leaves me baffled at this time.

Any suggestions are most welcome.

Tony

bhattlab / mgefinder Goto Github PK

mgefinder's People

Contributors

Stargazers

Watchers

Forkers

mgefinder's Issues

============================================================ ===== Summary of your script job =====

CHECKING DEPENDENCIES

PARAMETERS

AUTO-DETECTING ADAPTER TYPE

SUMMARISING RUN PARAMETERS

CHECKING DEPENDENCIES

CHECKING DEPENDENCIES

PARAMETERS

CHECKING DEPENDENCIES

PARAMETERS

PARAMETERS

CHECKING DEPENDENCIES

Recommend Projects

Recommend Topics

Recommend Org

============================================================
===== Summary of your script job =====