nanoporetech / megalodon Goto Github PK

Megalodon is a research command line tool to extract high accuracy modified base and sequence variant calls from raw nanopore reads by anchoring the information rich basecalling neural network output to a reference genome/transriptome.

License: Other

Python 92.12% Shell 1.73% Cython 6.15%

basecalling epigenetics variant-calling guppy

megalodon's Introduction

We have new bioinformatic resources that replace the functionality of this project! For production modified base calling see the Dorado repository. For modified base data preparation, model training and experimental model testing see the Remora repository. For post-processing and analysis of modified base calling results see the modkit repository.

This repository is now unsupported and we do not recommend its use. Please contact Oxford Nanopore: [email protected] for help with your application if it is not possible to upgrade.

Megalodon

Raw nanopore reads are processed by a single command to produce basecalls (FASTA/Q), reference mappings (SAM/BAM/CRAM), modified base calls (per-read and bedgraph/bedmethyl/modVCF), sequence variant calls (per-read and VCF) and more.

Detailed documentation for all megalodon commands and algorithms can be found on the Megalodon documentation page.

Prerequisites

The primary Megalodon run mode requires the Guppy basecaller (version >= 4.0). See the community page for download/installation instructions [login required].

Megalodon is a python-based command line software package. Given a python (version >= 3.6) installation, all other requirements are handled by pip or conda.

Taiyaki is no longer required to run Megalodon, but installation is required for two specific run modes:

output mapped signal files (for basecall model training)

running the Taiyaki basecalling backend (for neural network designs including experimental layers)

The ont-pyguppy-client-api dependency provides select release specifications, so compatible python versions and operating systems may be limited.

Installation

pip is recommended for Megalodon installation.

pip install megalodon

conda installation is available, but not fully supported. ont_pyguppy_client_lib is not available on conda and thus must be installed with pip.

conda install megalodon
pip install ont_pyguppy_client_lib

To install from github source for development, the following commands can be run.

git clone https://github.com/nanoporetech/megalodon
pip install -e megalodon/

Getting Started

Megalodon must obtain the intermediate output from the basecall neural network. Guppy (production nanopore basecalling software) is the recommended backend to obtain this output from raw nanopore signal (from FAST5 files). Nanopore basecalling is compute intensive and thus it is highly recommended that GPU resources are specified (--devices) for optimal Megalodon performance.

Megalodon is accessed via the command line interface megalodon command.

# megalodon help (common args)
megalodon -h
# megalodon help (all args)
megalodon --help-long

# Example command to output basecalls, mappings, and CpG 5mC and 5hmC methylation in both per-read (``mod_mappings``) and aggregated (``mods``) formats
#   Compute settings: GPU devices 0 and 1 with 20 CPU cores
megalodon \
    raw_fast5s/ \
    --guppy-config dna_r9.4.1_450bps_fast.cfg \
    --remora-modified-bases dna_r9.4.1_e8 fast 0.0.0 5hmc_5mc CG 0 \
    --outputs basecalls mappings mod_mappings mods \
    --reference reference.fa \
    --devices 0 1 \
    --processes 20

The above command uses the modified base model included in Remora. For more details on Remora modified base settings see the Remora repository.

This command produces the megalodon_results output directory containing all requested output files and logs. The format for common outputs is described briefly below and in more detail in the full documentation

The code below shows how to obtain and run the R9.4.1, MinION/GridION, 5mC CpG model (same model shipped with Guppy as of 4.5.2 release).

# Obtain and run R9.4.1, MinION, 5mC CpG model from Rerio
git clone https://github.com/nanoporetech/rerio
rerio/download_model.py rerio/basecall_models/res_dna_r941_min_modbases_5mC_CpG_v001
megalodon \
    raw_fast5s/ \
    --guppy-params "-d ./rerio/basecall_models/" \
    --guppy-config res_dna_r941_min_modbases_5mC_CpG_v001.cfg \
    --outputs basecalls mappings mod_mappings mods \
    --reference reference.fa \
    --mod-motif m CG 0 \
    --devices 0 1 \
    --processes 20

The path to the guppy_basecall_server executable is required to run Megalodon. By default, Megalodon assumes Guppy (Linux GPU) is installed in the current working directory (i.e. ./ont-guppy/bin/guppy_basecall_server). Use the --guppy-server-path argument to specify a different path.

Inputs

Raw reads
- Directory containing raw read FAST5 files (sub-directories recursively searched)
Reference
- Genome or transcriptome sequence reference (FASTA or minimap2 index)
Variants File
- Megalodon requires a set of candidate variants for --outputs variants (provide via --variant-filename argument; VCF or BCF).

Outputs

All Megalodon outputs are written into the directory specified with the --output-directory option with standard file names and extensions.

Basecalls
- Format: FASTQ (default) or FASTA
- Basecall-anchored modified base scores are also available in hts-spec BAM format tags (--outputs mod_basecalls).
Mappings
- Format: SAM, BAM (default), or CRAM
- A tab-separated mapping text summary is also produced including per-read alignment statistics.
Modified Base Calls
- The basecalling model specifies the modified bases capable of being output. See megalodon_extras modified_bases describe_alphabet.
- Per-read modified base calls
  - SQL DB containing per-read modified base scores at each covered reference location
  - Reference-anchored per-read modified base calls is BAM format via the Mm and Ml tags (see hts-spec specifications here).
- Aggregated calls
  - Format: bedgraph, bedmethyl (default), and/or modVCF
- In order to restrict modified base calls to a specific motif(s) specify the --mod-motif argument. For example, to restrict calls to CpG sites specify --mod-motif Z CG 0.
Sequence Variant Calls
- Per-read Variant Calls
  - SQL DB containing per-read variant scores for each covered variant
- Aggregated calls
  - Format: VCF
  - Default run mode is diploid. To run in haploid mode, set --haploid flag.
  - For best results on a diploid genome see the variant phasing workflow on the full documentation page.

Live Processing

Megalodon supports live run processing. Activate live processing mode by simply adding the --live-processing argument and specifying the MinKNOW output directory as the Megalodon FAST5 input directory. Megalodon will continue to search for FAST5s until the final_summary* file is created by MinKNOW, indicating data production has completed.

Guppy Models and Parameters

The basecalling model defines the modified bases capable of being output by Megalodon. Basecalling models must be trained to specifically detect a type or types of modified bases. See the Megalodon documentation here for instructions to construct modified base training data and train a new modified base model.

By default, Megalodon uses the dna_r9.4.1_450bps_modbases_5mc_hac.cfg Guppy config (released in version 4.5.2). This config is compatible with DNA, R9.4.1, MinION/GridION reads and allows output of 5mC calls in all contexts. Use the --guppy-config option to specify a different guppy model config. The appropriate Rerio model is recommended for the highest accuracy modified base calls.

All configs can be used to output basecalls and mappings (as well as signal_mappings and per_read_refs for basecall training). Modified base and sequence variant outputs require Megalodon calibration files. To list configs with default calibration files, run megalodon --list-supported-guppy-configs. See calibration documentation here for details on Megalodon model calibration.

Only flip-flop configs/models are currently supported by Megalodon (this excludes k-mer based and RLE model types).

In addition to the --guppy-config and --guppy-server-path options, a number of additional arguments control the behavior of the guppy backend. The --guppy-params argument will pass arguments directly to the guppy_basecall_server initialization call. For example to optimize GPU usage, the following option might be specified: --guppy-params "--num_callers 5 --ipc_threads 6"

Finally the --guppy-timeout arguments ensures that a run will not stall on few reads or with lower compute resources. The Guppy server unable to recieve read error indicate that the Guppy server is overwhelmed. Consider lowering the --processes and/or --guppy-concurrent-reads values to reduce these errors. Finding the right balance for these parameters can help achieve optimal performance on a system.

Disk Performance Considerations

The status of the extract signal input queue and output queues is displayed by default (suppress with --suppress-queues-status).

If the extract_signal input queue is often empty, Megalodon is waiting on reading raw signal from FAST5 files. If the input queue remains empty, increasing the --num-read-enumeration-threads and/or --num-extract-signal-processes parameters (defaults 8 and 2)) may improve performance. Note that [--num-read-enumeration-threads] threads will be opened within each extract signal process. Alternatively and if available, the input FAST5s disk location could be moved to faster I/O disk.

If any output status bars indicate a full queue, Megalodon will stall waiting on that process to write data to disk. Moving the --output-directory accordingly to a location with faster disk I/O performance should improve performance. Per-read modified base and variant statistics are stored in an on-disk sqlite database, which can be very dependent on disk speed and configuration.

High Quality Phased Variant Calls

In order to obtain the highest quality diploid sequence variant calls, the full variant phasing pipeline employing whatshap should be applied. This pipeline is described in detail on the full documentation page. The default diploid variant settings are optimized for the full phasing pipeline and not the highest quality diploid calls directly from a single Megalodon call.

High-Density Variants

When running Megalodon with a high density of variants (more than 1 variant per 100 reference bases), certain steps can be taken to increase performance. See variant atomize documentation for further details.

RNA

Megalodon supports processing direct RNA nanopore data. In order to process an RNA sample specify the --rna flag as well as an RNA model using the --guppy-config argument.

Megalodon performs mapping using the standard minimap2 option, map-ont, and not the splice option, so a transcriptome reference must be provided. The Megalodon code supports RNA modified base detection, but currently no RNA modified base basecalling models are released.

Megalodon does not currently perform checking that a set of reads agree with the provided model or options specified (e.g. --rna). Users should take care to ensure that the correct options are specified for each sample processed.

License and Copyright

Megalodon is distributed under the terms of the Oxford Nanopore Technologies, Ltd. Public License, v. 1.0. If a copy of the License was not distributed with this file, You can obtain one at http://nanoporetech.com

Research Release

Research releases are provided as technology demonstrators to provide early access to features or stimulate Community development of tools. Support for this software will be minimal and is only provided directly by the developers. Feature requests, improvements, and discussions are welcome and can be implemented by forking and pull requests. However much as we would like to rectify every issue and piece of feedback users may have, the developers may have limited resource for support of this software. Research releases may be unstable and subject to rapid iteration by Oxford Nanopore Technologies.

megalodon's People

Contributors

Stargazers

Watchers

megalodon's Issues

Aggregation takes longer time than normal

I was running megalodon and output directory is also located on a local disk. However, the aggregation step seems to take forever to run. Because I ran megalodon successfully before under same environment whose aggregation is much much faster than this one. Any idea to solve this?

Model for variant calls

Awesome work on this, quick question: which model is preferable for the variant calling workflow on haploid (bacterial) genomes - methylation-aware or the standard HAC?

megalodon basemod TypeError

Greetings,

I am testing megalodon for base mod calling only on an ecoli sample. Numpy and cython were previously installed, and I ran the following to install megalodon

pip install git+https://github.com/nanoporetech/megalodon.git

This is the command of my test run:

megalodon /path/to/guppy/fast5/workspace/ --outputs mods --reference /path/to/ecoli.genome.fasta --processes 8

The job died immediately. The stderr for the job is:

Traceback (most recent call last):
  File "/home/etvedte/.local/bin/megalodon", line 8, in <module>
    sys.exit(_main())
  File "/home/etvedte/.local/lib/python3.5/site-packages/megalodon/megalodon.py", line 1118, in _main
    max_concur_chunks=args.max_concurrent_chunks)
  File "/home/etvedte/.local/lib/python3.5/site-packages/megalodon/backends.py", line 234, in __init__
    self._load_fast5_post_out()
  File "/home/etvedte/.local/lib/python3.5/site-packages/megalodon/backends.py", line 162, in _load_fast5_post_out
    self.output_size) = get_model_info_from_fast5(read)
  File "/home/etvedte/.local/lib/python3.5/site-packages/megalodon/backends.py", line 144, in get_model_info_from_fast5
    out_size = fast5_io.get_posteriors(read).shape[1]
  File "/home/etvedte/.local/lib/python3.5/site-packages/megalodon/fast5_io.py", line 61, in get_posteriors
    latest_basecall + '/BaseCalled_template', 'StateData', proxy=False)
TypeError: get_analysis_dataset() got an unexpected keyword argument 'proxy'

Question: Is it possible to merge intermediate files and continue calling methylation?

Hello,

I was wondering if I can speed up megalodon by running it on different cluster nodes with batches of fast5 and then merge the intermediate files (for example per_read_modified_base_calls.db or per-read modified bases text file) which then fed to megalodon to call final modifications. I checked the megalodon/scripts/run_aggregation.py code but couldn't figure it out what it takes as input. Can you give some advise on these?

Poor modified base results observed for PromethION data

Can you find some benchmark testing result of the DNA methylation detection model in guppy and megalodon such as the correlation of methylation frequency between bisulfite sequencing and megalodon for me? I can not find such kind of material on nanopore community or github.
Thank you so much for your kind help!

Error in Aggregation

Hi
I was running Megalodon on a set of reads but once it passed the indexing step I got the following error:
[16:39:39] Waiting for mods database to complete indexing.
Traceback (most recent call last):
File "/home/group/.local/bin/megalodon", line 11, in
load_entry_point('megalodon==2.1.0', 'console_scripts', 'megalodon')()
File "/home/group/.local/lib/python3.6/site-packages/megalodon/main.py", line 361, in _main
megalodon._main(args)
File "/home/group/.local/lib/python3.6/site-packages/megalodon/megalodon.py", line 1105, in _main
mods_info.mod_output_fmts, args.suppress_progress_bars)
File "/home/group/.local/lib/python3.6/site-packages/megalodon/aggregate.py", line 302, in aggregate_stats
agg_mods = mods.AggMods(mods_db_fn, no_indices_in_mem=True)
File "/home/group/.local/lib/python3.6/site-packages/megalodon/mods.py", line 2040, in init
self.mods_db = ModsDb(mods_db_fn)
File "/home/group/.local/lib/python3.6/site-packages/megalodon/mods.py", line 149, in init
self.establish_db_conn()
File "/home/group/.local/lib/python3.6/site-packages/megalodon/mods.py", line 197, in establish_db_conn
self.db.execute('PRAGMA cache_size = {}'.format(-mh.SQLITE_CACHE_SIZE))
sqlite3.OperationalError: attempt to write a readonly database

This seems odd that it was able to write to the database in the previous step but then encountered an error. I then manually changed the permissions to allow all read and write and tried to run 'megalodon_extras aggregate run --outputs mods' and still get the same error.

Any advice on this would be greatly appreciated.

possible bug in help output ?

Thanks for the interesting tool!

I think there is a "not" missing here:

but --variant-filename provided. 
should be
but --variant-filename not provided. 


cat slurm-2909915.out
Input dir:  /ngsssd1/rcug/ONT_thermo_2019/20190620_1055_MN32001_FAK85862_bf499a51/fast5
[09:41:08] Using canoncical alphabet ACGT and modified bases Y=6mA Z=5mC.
[09:41:08] Loading reference.
****************************************************************************************************
        ERROR: per_read_snps output requested, but --variant-filename provided. 
****************************************************************************************************

Modifications calling without guppy or tayaki ?

Hi Marcus,

It is a very naive question I have. I followed the instruction and create a conda env for megalodon and I was happy to finally try it out personally. I am mainly interested in mods calling.

I got this error :

******************** WARNING: Quality score computation not enabled for taiyaki or FAST5 backends. Quality scores will be invalid. ********************
[14:39:30] Loading FAST5 basecalling backend.
Traceback (most recent call last):
  File "/home/pterzian/.conda/envs/megalodonenv/bin/megalodon", line 11, in <module>
    sys.exit(_main())
  File "/home/pterzian/.conda/envs/megalodonenv/lib/python3.6/site-packages/megalodon/megalodon.py", line 1212, in _main
    with backends.ModelInfo(backend_params, args.processes) as model_info:
  File "/home/pterzian/.conda/envs/megalodonenv/lib/python3.6/site-packages/megalodon/backends.py", line 407, in __init__
    self._load_fast5_post_out()
  File "/home/pterzian/.conda/envs/megalodonenv/lib/python3.6/site-packages/megalodon/backends.py", line 285, in _load_fast5_post_out
    self.output_size) = get_model_info_from_fast5(read)
  File "/home/pterzian/.conda/envs/megalodonenv/lib/python3.6/site-packages/megalodon/backends.py", line 266, in get_model_info_from_fast5
    out_size = fast5_io.get_posteriors(read).shape[1]
  File "/home/pterzian/.conda/envs/megalodonenv/lib/python3.6/site-packages/megalodon/fast5_io.py", line 76, in get_posteriors
    '--post_out were set when running guppy.')
megalodon.megalodon_helper.MegaError: StateData not found in FAST5 file. Ensure --fsat5_out and --post_out were set when running guppy.

When running this command :

megalodon cp_fast5/ \
    --outputs basecalls mods \
    --reference reference.fa \
    --mod-motif Z CG 0 --devices 0 --processes 10 \
    --verbose-read-progress 3 \
    --do-not-use-guppy-server \
    --overwrite

In the documentation I read "The path to the guppy_basecall_server executable is required to run megalodon" but I found this argument --do-not-use-guppy-server in the full documentation (and naively hoped it would help).

So I guess I am missing something ! Maybe it is my environment or the raw fast5 files format ?

Best,

Paul

Some confusion about the result of Aggregated modification call

Hi:
@marcus1487 I'm glad to experience the amazing new tool you developed ,bravo!
I have some questions about column definitions in output file that is meet bedMethyl format.

line 5:Score from 0-1000. Capped number of reads

What exactly does this score mean?score of methylation?regularization depth?

line Percentage of reads that show methylation at this position in the genome

Does this percentage refer to the num_mod_read/num_coverage in Specified site?
So are both of these numerical indicators used to measure the reliability and degree of the modification?
Thanks again for your help!

pyguppyclient raised Except("Bad request")

Hi there,

I have encountered some difficulties while using megalodon with rerio model for DRS reads. I am not sure if I have done something wrong. THank you so much!

Software Version
Guppy Basecall Service Software, (C) Oxford Nanopore Technologies, Limited. Version 4.0.14+8d3226e, client-server API version 2.1.0
Megalodon version: 2.1.1

Command used
megalodon /exeh_2/sin/data/nanopore/20200601_1610_X1_FAL11238_37c3c671/fast5_pass/FAL11238_pass_acd71f2d_21.fast5 --rna --guppy-server-path /exeh_2/sin/bin/ont-guppy-cpu/bin/guppy_basecall_server --guppy-params "-d /exeh_2/sin/bin/rerio/basecall_models/" --guppy-config res_rna2_r941_min_flipflop_v001.cfg --processes 20 --outputs mod_mappings --mappings-format bam --reference /exeh_2/sin/references/Homo_sapiens.GRCh38.dna_sm.primary_assembly.fa

Error raised
[21:47:24] Loading guppy basecalling backend.
Traceback (most recent call last):
File "/exeh_2/sin/bin/miniconda3/envs/megalodon/bin/megalodon", line 11, in
sys.exit(_main())
File "/exeh_2/sin/bin/miniconda3/envs/megalodon/lib/python3.7/site-packages/megalodon/main.py", line 361, in _main
megalodon._main(args)
File "/exeh_2/sin/bin/miniconda3/envs/megalodon/lib/python3.7/site-packages/megalodon/megalodon.py", line 1071, in _main
with backends.ModelInfo(backend_params, args.processes) as model_info:
File "/exeh_2/sin/bin/miniconda3/envs/megalodon/lib/python3.7/site-packages/megalodon/backends.py", line 418, in init
self._load_pyguppy()
File "/exeh_2/sin/bin/miniconda3/envs/megalodon/lib/python3.7/site-packages/megalodon/backends.py", line 409, in _load_pyguppy
set_pyguppy_model_attributes()
File "/exeh_2/sin/bin/miniconda3/envs/megalodon/lib/python3.7/site-packages/megalodon/backends.py", line 371, in set_pyguppy_model_attributes
init_client.connect()
File "/exeh_2/sin/bin/miniconda3/envs/megalodon/lib/python3.7/site-packages/pyguppyclient/client.py", line 73, in connect
config = self._load_config(self.config_name)
File "/exeh_2/sin/bin/miniconda3/envs/megalodon/lib/python3.7/site-packages/pyguppyclient/client.py", line 95, in _load_config
loaded_configs = {Config(c).name: Config(c) for c in self.get_configs()}
File "/exeh_2/sin/bin/miniconda3/envs/megalodon/lib/python3.7/site-packages/pyguppyclient/client.py", line 91, in get_configs
res = self.send(SimpleRequestType.GET_CONFIGS)
File "/exeh_2/sin/bin/miniconda3/envs/megalodon/lib/python3.7/site-packages/pyguppyclient/client.py", line 55, in send
return simple_response(self.recv())
File "/exeh_2/sin/bin/miniconda3/envs/megalodon/lib/python3.7/site-packages/pyguppyclient/ipc.py", line 103, in simple_response
raise Exception("Bad request")
Exception: Bad request

Restarting aggregation using run_aggregation.py

Hi,

I am trying to detect methylation in our reads but the node crashed after the all the reads were analysed and outputted to the .db but before the modified positions has started being outputted to the .bed (I guess while the .dp was indexing). Can you use the run_aggregation.py scrip to restart the modified base output in this case?

I am trying to run the script using:
python ~/megalodon/scripts/run_aggregation.py --megalodon-directory ~/Documents/Megalodon/megalodon_results --output-suffix recalled --outputs mods --processes 7

but get the following error:

Traceback (most recent call last):
File "/home/group/megalodon/scripts/run_aggregation.py", line 126, in
main()
File "/home/group/megalodon/scripts/run_aggregation.py", line 110, in main
args.suppress_progress, valid_read_ids, args.output_suffix)
File "/home/group/anaconda3/lib/python3.7/site-packages/megalodon/aggregate.py", line 312, in aggregate_stats
agg_mods = mods.AggMods(mods_db_fn, load_in_mem_indices=False)
File "/home/group/anaconda3/lib/python3.7/site-packages/megalodon/mods.py", line 1291, in init
self._mod_long_names = self.mods_db.get_mod_long_names()
File "/home/group/anaconda3/lib/python3.7/site-packages/megalodon/mods.py", line 214, in get_mod_long_names
'SELECT mod_base, mod_long_name FROM mod_long_names').fetchall())
sqlite3.OperationalError: no such table: mod_long_names

This also happens when I try run the command with .dp files that have gone all the way through the pipeline - so I don't think it is related to the indexing.

Any help would be appreciated.

Best,

Matt

missing /ont-guppy/bin/guppy_basecall_server'

Thanks for the tool.

Just installed and run it.
even we re-installed guppy_basecall_server
I think I am missing some modules?

[11:33:37] Loading guppy basecalling backend.
Traceback (most recent call last):
File "/apps/megalodon/2.0.0/bin/megalodon", line 11, in
sys.exit(_main())
File "/apps/megalodon/2.0.0/lib/python3.7/site-packages/megalodon/megalodon.py", line 1212, in _main
with backends.ModelInfo(backend_params, args.processes) as model_info:
File "/apps/megalodon/2.0.0/lib/python3.7/site-packages/megalodon/backends.py", line 403, in init
self._load_pyguppy()
File "/apps/megalodon/2.0.0/lib/python3.7/site-packages/megalodon/backends.py", line 393, in _load_pyguppy
start_guppy_server()
File "/apps/megalodon/2.0.0/lib/python3.7/site-packages/megalodon/backends.py", line 340, in start_guppy_server
stdout=self.guppy_out_fp, stderr=self.guppy_err_fp)
File "/apps/megalodon/2.0.0/lib/python3.7/subprocess.py", line 800, in init
restore_signals, start_new_session)
File "/apps/megalodon/2.0.0/lib/python3.7/subprocess.py", line 1551, in _execute_child
raise child_exception_type(errno_num, err_msg, err_filename)
FileNotFoundError: [Errno 2] No such file or directory: './ont-guppy/bin/guppy_basecall_server': './ont-guppy/bin/guppy_basecall_server'
~

conda package

Any plans for a conda package?

Guppy multiple server ports

hi marcus, sorry to bug again. i noticed that it's currently not possible to run multiple forks of guppy_basecall_server conveniently within a nextflow on the same machine.

the issue is that --guppy-server-port has to be specified manually and cannot be overwritten by e.g. --guppy-params "--port auto" . it also does not accept strings as argument e.g. --guppy-server-port auto

maybe a convenient fix for this would be to be able to pass auto to megalodon --guppy-server-port?

What DNA modification is support

Hi,

What DNA modification is support in Megalodon? Only 5mC from 5mC in CpG format? How about 5hmC/5fC/5caC? How about 4mC or 6mA?

basically all based are modified

Hi!

I was trying to call modifications of a PromethIOn run using this command:

megalodon \
--guppy-config dna_r9.4.1_450bps_modbases_dam-dcm-cpg_hac_prom.cfg \
--guppy-server-path=/home/ubuntu/tools/ont-guppy/bin/guppy_basecall_server \
--devices 0 1 \
--processes 40 \
--guppy-params "--num_callers 12 --ipc_threads 16" \
--outputs "mappings" "mods" \
--mod-output-formats "bedmethyl" "modvcf" \
--guppy-timeout 10.0 \
--output-directory megalodon \
--mappings-format bam \
--reference $REF \
$fast5_dir

Now when I load the resulting bed files into IGV, together with my reference assembly, basically all bases of my assembly are modified. I guess this is a user mistake so I would be happy to get suggestions on the usage. Please specify any log files needed to help.

Thanks
Michael

WARNING: Error inserting modified base scores into database.

Using megalodon made from cloning github repository and GPU calling. Basecalling continues when error happens.

Command line: nohup nice megalodon fastq/workspace/ --outputs basecalls mappings per_read_mods --write-mods-text --mod-motif Z CG 0 --output-directory megalodon/ --mappings-format bam --reference /mnt/ix1/Resources/GenomeRef/Homo_sapiens/NCBI/GRCh38/Sequence/WholeGenomeFasta/genome.fa --devices 2

In the nohup.out I get: ******************** WARNING: Error inserting modified base scores into database. See log debug output for error details. ********************

And in the log.out I see:
DBG: megalodon: 197: Successfully processed read 005b4e07-9eca-4126-b830-b4cc9eca6ee8
DBG: megalodon: 189: Analyzing read 005c5dd8-06e4-4d4f-965e-7520c42b44c3
DBG: megalodon: 197: Successfully processed read 005c5dd8-06e4-4d4f-965e-7520c42b44c3
DBG: megalodon: 189: Analyzing read 005d5a58-1224-4c2e-837d-7708ce5bfe9e
DBG: megalodon: 208: Incomplete processing for read 005d5a58-1224-4c2e-837d-7708ce5bfe9e ::: No alignment
DBG: megalodon: 189: Analyzing read 0081f81f-1c9e-4997-b17c-6eaa90e72566
DBG: megalodon: 197: Successfully processed read 0081f81f-1c9e-4997-b17c-6eaa90e72566
DBG: megalodon: 189: Analyzing read 00880d64-7463-4d6a-a7c2-2a6af97c4aca
DBG: megalodon: 197: Successfully processed read 00880d64-7463-4d6a-a7c2-2a6af97c4aca
DBG: megalodon: 189: Analyzing read 00884bde-63e5-4eb7-b496-5a24f37d2e17
DBG: mods: 538: Error inserting modified base scores into database: tuple index out of range
Traceback (most recent call last):
File "/home/billylau/.conda/envs/megalodon/lib/python3.6/site-packages/megalodon/mods.py", line 527, in store_mod_call
mods_db.insert_read_scores(r_mod_scores, read_id, chrm, strand)
File "/home/billylau/.conda/envs/megalodon/lib/python3.6/site-packages/megalodon/mods.py", line 237, in insert_read_scores
pos_ids = self.get_pos_ids_or_insert(r_mod_scores, chrm_id, strand)
File "/home/billylau/.conda/envs/megalodon/lib/python3.6/site-packages/megalodon/mods.py", line 165, in get_pos_ids_or_insert
r_pos = tuple(zip(*r_mod_scores))[0]
IndexError: tuple index out of range

How to interpret basecall anchored modification scores.

Hi,

I have successfully called bases along with modification using Megalodon.
However I am not sure how to interpret the HDF5 data.
In the documentation, the values are decscribed as "similar to the guppy output format descibed on the community page", but I could not find the corresponding guppy documentation.

Is the number something like log10 value of the posterior probability of the base being modified?

Thank you.

generate sequencing_summary.txt file?

Just want to know if megalodon plans to produce sequencing_summary.txt.
This file is used for quality control of nanopore reads by many software tools.
But it does not seems to be generated by megalodon 2.0.

sqlite3.OperationalError: database or disk is full while over 150 GB free on the drive

Hi, I've encountered another error - megalodon complained about lack of space on the disk, however there is over 150 GB free on my SSD.
I've noticed at times (especially toward the end of the run) megalodon uses way over 40-50GB of RAM, although most of the time it's fine with just 3GB of RAM. Could that be an issue? Are you creating any databases in memory?

I've noticed some indices are created in memory:
./megalodon/mods.py:87: mod_index_in_memory=True, uuid_index_in_memory=True):

Could that be the problem?
This seems to be done toward the end of the run and some of my few days runs crashed toward the end and unfortunately nothing was stored in the per_read_modified_base_calls.db...

I guess this is related to megalodon/megalodon_helper.py - maybe do that a parameters specified by the user? 64GB for mapping in-memory sqlite3 db is crazy amount of RAM. Also another question, this is 64GB per process or overall?

# allow 64GB for memory mapped sqlite file access
MEMORY_MAP_LIMIT = 64000000000

Full stack below.

[09:45:58] Using canonical alphabet ACGT and modified bases Y=6mA Z=5mC.
[09:45:58] Loading reference.
[09:46:00] Preparing workers to process reads.
[09:46:00] Processing reads.
3 most common unsuccessful read types:
    13.0% (  80465 reads) : No alignment
     0.0% (    224 reads) : Likely out of memory error.
     -----
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 619525/619525 [88:35:31<00:00,  1.94read/s, ksamp/s=187]
[02:21:32] Unsuccessful processing types:
    13.0% (  80465 reads) : No alignment
     0.0% (    224 reads) : Likely out of memory error.
[02:21:32] Waiting for mods database to complete indexing.
Process Process-5:
Traceback (most recent call last):
  File "/home/lpryszcz/src/miniconda3/lib/python3.7/multiprocessing/process.py", line 297, in _bootstrap
    self.run()
  File "/home/lpryszcz/src/miniconda3/lib/python3.7/multiprocessing/process.py", line 99, in run
    self._target(*self._args, **self._kwargs)
  File "/home/lpryszcz/src/miniconda3/lib/python3.7/site-packages/megalodon/mods.py", line 721, in _get_mods_queue
    mods_db.create_data_covering_index()
  File "/home/lpryszcz/src/miniconda3/lib/python3.7/site-packages/megalodon/mods.py", line 373, in create_data_covering_index
    self.cur.execute('CREATE INDEX data_cov_idx ON data(' +
sqlite3.OperationalError: database or disk is full
[02:22:22] Spawning process to sort mappings
[02:22:22] Spawning modified base aggregation processes.
[02:22:22] Aggregating 0 variants and 0 modified base sites over reads.
[02:22:22] Waiting for mappings sort

Question about modification model accuracy and compatibility

Hi,
How accuracy can megalodon be when using the res_dna_r941_min_modbases_5mC_5hmC_CpG_v001 and res_dna_r941_min_modbases-all-context_v001 model to extract all context 5mc methylation and 5hmc methylation information? Have you ever tested those models before release them? If you have, can you show me your benchmark data? So that I can be confident in using those models in my research project.

As I want to extract 5hmc methylation information and 5mc methylation information on CHH and CHG motif using PromethION data, what should I do to make res_dna_r941_min_modbases_5mC_5hmC_CpG_v001 and res_dna_r941_min_modbases-all-context_v001 model (which were trained for MinION/GridION data, lack PromethION version ) suitable for my PromethION data ?

Install issues

Hi,

I've had quite a few goes at the install and ironed out a few issues.

On my end behind a Univ proxy, git+http works a lot better than git+git .

Taiyaki seems to be happier if both torch and pytest are installed prior to installation, in addition to cython and numpy.

I should note megalodon itself is the next install step after taiyaki.

I can't work out this error though, do you have suggestions ? OS=Ubuntu 1604

Thanks

TASK [Install pip taiyaki] *************************************************************************************************************************************
fatal: [172.24.8.22]: FAILED! => {"changed": false, "cmd": ["/home/rcug/.local/bin/pip3", "install", "--user", "git+http://github.com/nanoporetech/taiyaki.git"], "msg": "stdout: Collecting git+http://github.com/nanoporetech/taiyaki.git\n Cloning http://github.com/nanoporetech/taiyaki.git to ./pip-req-build-gx3opexs\n\n:stderr: Running command git clone -q http://github.com/nanoporetech/taiyaki.git /tmp/pip-req-build-gx3opexs\n ERROR: Complete output from command python setup.py egg_info:\n ERROR: Compiling taiyaki/squiggle_match/squiggle_match.pyx because it changed.\n Compiling taiyaki/ctc/ctc.pyx because it changed.\n [1/2] Cythonizing taiyaki/ctc/ctc.pyx\n [2/2] Cythonizing taiyaki/squiggle_match/squiggle_match.pyx\n \n Installed /tmp/pip-req-build-gx3opexs/.eggs/pytest_xdist-1.29.0-py3.5.egg\n Searching for pytest-runner\n Reading https://pypi.python.org/simple/pytest-runner/\n Best match: pytest-runner 5.1\n Downloading https://files.pythonhosted.org/packages/d9/6d/4b41a74b31720e25abd4799be72d54811da4b4d0233e38b75864dcc1f7ad/pytest-runner-5.1.tar.gz#sha256=25a013c8d84f0ca60bb01bd11913a3bcab420f601f0f236de4423074af656e7a\n Processing pytest-runner-5.1.tar.gz\n Writing /tmp/easy_install-uo35u53s/pytest-runner-5.1/setup.cfg\n Running pytest-runner-5.1/setup.py -q bdist_egg --dist-dir /tmp/easy_install-uo35u53s/pytest-runner-5.1/egg-dist-tmp-jm9ja8nu\n /home/rcug/.local/lib/python3.5/site-packages/Cython/Compiler/Main.py:369: FutureWarning: Cython directive 'language_level' not set, using 2 for now (Py2). This will change in a later release! File: /tmp/pip-req-build-gx3opexs/taiyaki/ctc/ctc.pyx\n tree = Parsing.p_module(s, pxd, full_module_name)\n /home/rcug/.local/lib/python3.5/site-packages/Cython/Compiler/Main.py:369: FutureWarning: Cython directive 'language_level' not set, using 2 for now (Py2). This will change in a later release! File: /tmp/pip-req-build-gx3opexs/taiyaki/squiggle_match/squiggle_match.pyx\n tree = Parsing.p_module(s, pxd, full_module_name)\n /usr/lib/python3.5/distutils/dist.py:261: UserWarning: Unknown distribution option: 'python_requires'\n warnings.warn(msg)\n zip_safe flag not set; analyzing archive contents...\n Moving pytest_runner-5.1-py3.5.egg to /tmp/pip-req-build-gx3opexs/.eggs\n \n Installed /tmp/pip-req-build-gx3opexs/.eggs/pytest_runner-5.1-py3.5.egg\n Searching for pytest-forked\n Reading https://pypi.python.org/simple/pytest-forked/\n Best match: pytest-forked 1.0.2\n Downloading https://files.pythonhosted.org/packages/30/be/cb5dc4f0fa5ba121943305f4f235dc1a30fae53daac20094ab89f4618578/pytest-forked-1.0.2.tar.gz#sha256=d352aaced2ebd54d42a65825722cb433004b4446ab5d2044851d9cc7a00c9e38\n Processing pytest-forked-1.0.2.tar.gz\n Writing /tmp/easy_install-fle6byba/pytest-forked-1.0.2/setup.cfg\n Running pytest-forked-1.0.2/setup.py -q bdist_egg --dist-dir /tmp/easy_install-fle6byba/pytest-forked-1.0.2/egg-dist-tmp-30bu5r5s\n warning: no files found matching 'README.txt'\n no previously-included directories found matching '.git'\n creating /tmp/pip-req-build-gx3opexs/.eggs/pytest_forked-1.0.2-py3.5.egg\n Extracting pytest_forked-1.0.2-py3.5.egg to /tmp/pip-req-build-gx3opexs/.eggs\n \n Installed /tmp/pip-req-build-gx3opexs/.eggs/pytest_forked-1.0.2-py3.5.egg\n Searching for execnet>=1.1\n Reading https://pypi.python.org/simple/execnet/\n Best match: execnet 1.6.0\n Downloading https://files.pythonhosted.org/packages/fe/9c/215c0b6a82a6b01a89d46559f401045aba2e166a91e545c16960e2bb62df/execnet-1.6.0.tar.gz#sha256=752a3786f17416d491f833a29217dda3ea4a471fc5269c492eebcee8cc4772d3\n Processing execnet-1.6.0.tar.gz\n Writing /tmp/easy_install-c8ute6tx/execnet-1.6.0/setup.cfg\n Running execnet-1.6.0/setup.py -q bdist_egg --dist-dir /tmp/easy_install-c8ute6tx/execnet-1.6.0/egg-dist-tmp-62zpijii\n /usr/lib/python3.5/distutils/dist.py:261: UserWarning: Unknown distribution option: 'python_requires'\n warnings.warn(msg)\n zip_safe flag not set; analyzing archive contents...\n execnet.script.pycache.socketserverservice.cpython-35: module references file\n execnet.pycache.gateway_bootstrap.cpython-35: module references file\n execnet.pycache.gateway_bootstrap.cpython-35: module MAY be using inspect.getsource\n execnet.pycache.gateway.cpython-35: module references file\n execnet.pycache.gateway.cpython-35: module MAY be using inspect.getsource\n execnet.pycache.gateway.cpython-35: module MAY be using inspect.getsourcefile\n creating /tmp/pip-req-build-gx3opexs/.eggs/execnet-1.6.0-py3.5.egg\n Extracting execnet-1.6.0-py3.5.egg to /tmp/pip-req-build-gx3opexs/.eggs\n \n Installed /tmp/pip-req-build-gx3opexs/.eggs/execnet-1.6.0-py3.5.egg\n Searching for apipkg>=1.4\n Reading https://pypi.python.org/simple/apipkg/\n Best match: apipkg 1.5\n Downloading https://files.pythonhosted.org/packages/a8/af/07a13b1560ebcc9bf4dd439aeb63243cbd8d374f4f328691470d6a9b9804/apipkg-1.5.tar.gz#sha256=37228cda29411948b422fae072f57e31d3396d2ee1c9783775980ee9c9990af6\n Processing apipkg-1.5.tar.gz\n Writing /tmp/easy_install-qwhyslib/apipkg-1.5/setup.cfg\n Running apipkg-1.5/setup.py -q bdist_egg --dist-dir /tmp/easy_install-qwhyslib/apipkg-1.5/egg-dist-tmp-v9oysguu\n warning: no previously-included files found matching 'pyproject.toml'\n \n Installed /tmp/easy_install-qwhyslib/apipkg-1.5/.eggs/setuptools-41.0.1-py3.5.egg\n Searching for setuptools_scm\n Reading https://pypi.python.org/simple/setuptools_scm/\n Best match: setuptools-scm 3.3.3\n Downloading https://files.pythonhosted.org/packages/38/d8/3789eb77845f8700bd12943835fdbd79b693c2f009cdcb985d7d33884f6c/setuptools_scm-3.3.3-py3.5.egg#sha256=38c714af0f04999f12c419b39bdae9969ebe94af40b230b0999afb5004a8deab\n Processing setuptools_scm-3.3.3-py3.5.egg\n Moving setuptools_scm-3.3.3-py3.5.egg to /tmp/easy_install-qwhyslib/apipkg-1.5/.eggs\n \n Installed /tmp/easy_install-qwhyslib/apipkg-1.5/.eggs/setuptools_scm-3.3.3-py3.5.egg\n Traceback (most recent call last):\n File "/usr/lib/python3/dist-packages/pkg_resources/init.py", line 2237, in resolve\n return functools.reduce(getattr, self.attrs, module)\n AttributeError: module 'setuptools.dist' has no attribute 'check_specifier'\n \n During handling of the above exception, another exception occurred:\n \n Traceback (most recent call last):\n File "/usr/lib/python3/dist-packages/setuptools/sandbox.py", line 154, in save_modules\n yield saved\n File "/usr/lib/python3/dist-packages/setuptools/sandbox.py", line 195, in setup_context\n yield\n File "/usr/lib/python3/dist-packages/setuptools/sandbox.py", line 243, in run_setup\n DirectorySandbox(setup_dir).run(runner)\n File "/usr/lib/python3/dist-packages/setuptools/sandbox.py", line 273, in run\n return func()\n File "/usr/lib/python3/dist-packages/setuptools/sandbox.py", line 242, in runner\n _execfile(setup_script, ns)\n File "/usr/lib/python3/dist-packages/setuptools/sandbox.py", line 46, in _execfile\n exec(code, globals, locals)\n File "/tmp/easy_install-qwhyslib/apipkg-1.5/setup.py", line 53, in \n name='taiyaki',\n File "/tmp/easy_install-qwhyslib/apipkg-1.5/setup.py", line 48, in main\n if any([cmd in sys.argv for cmd in ["install", "build", "build_clib", "build_ext", "bdist_wheel"]]):\n File "/usr/lib/python3.5/distutils/core.py", line 108, in setup\n _setup_distribution = dist = klass(attrs)\n File "/usr/lib/python3/dist-packages/setuptools/dist.py", line 272, in init\n _Distribution.init(self,attrs)\n File "/usr/lib/python3.5/distutils/dist.py", line 281, in init\n self.finalize_options()\n File "/usr/lib/python3/dist-packages/setuptools/dist.py", line 327, in finalize_options\n ep.load()(self, ep.name, value)\n File "/usr/lib/python3/dist-packages/pkg_resources/init.py", line 2229, in load\n return self.resolve()\n File "/usr/lib/python3/dist-packages/pkg_resources/init.py", line 2239, in resolve\n raise ImportError(str(exc))\n ImportError: module 'setuptools.dist' has no attribute 'check_specifier'\n \n During handling of the above exception, another exception occurred:\n \n Traceback (most recent call last):\n File "", line 1, in \n File "/tmp/pip-req-build-gx3opexs/setup.py", line 84, in \n scripts=[x for x in glob('bin/*.py')],\n File "/usr/lib/python3.5/distutils/core.py", line 108, in setup\n _setup_distribution = dist = klass(attrs)\n File "/usr/lib/python3/dist-packages/setuptools/dist.py", line 269, in init\n self.fetch_build_eggs(attrs['setup_requires'])\n File "/usr/lib/python3/dist-packages/setuptools/dist.py", line 313, in fetch_build_eggs\n replace_conflicting=True,\n File "/usr/lib/python3/dist-packages/pkg_resources/init.py", line 826, in resolve\n dist = best[req.key] = env.best_match(req, ws, installer)\n File "/usr/lib/python3/dist-packages/pkg_resources/init.py", line 1092, in best_match\n return self.obtain(req, installer)\n File "/usr/lib/python3/dist-packages/pkg_resources/init.py", line 1104, in obtain\n return installer(requirement)\n File "/usr/lib/python3/dist-packages/setuptools/dist.py", line 380, in fetch_build_egg\n return cmd.easy_install(req)\n File "/usr/lib/python3/dist-packages/setuptools/command/easy_install.py", line 663, in easy_install\n return self.install_item(spec, dist.location, tmpdir, deps)\n File "/usr/lib/python3/dist-packages/setuptools/command/easy_install.py", line 693, in install_item\n dists = self.install_eggs(spec, download, tmpdir)\n File "/usr/lib/python3/dist-packages/setuptools/command/easy_install.py", line 873, in install_eggs\n return self.build_and_install(setup_script, setup_base)\n File "/usr/lib/python3/dist-packages/setuptools/command/easy_install.py", line 1101, in build_and_install\n self.run_setup(setup_script, setup_base, args)\n File "/usr/lib/python3/dist-packages/setuptools/command/easy_install.py", line 1087, in run_setup\n run_setup(setup_script, args)\n File "/usr/lib/python3/dist-packages/setuptools/sandbox.py", line 246, in run_setup\n raise\n File "/usr/lib/python3.5/contextlib.py", line 77, in exit\n self.gen.throw(type, value, traceback)\n File "/usr/lib/python3/dist-packages/setuptools/sandbox.py", line 195, in setup_context\n yield\n File "/usr/lib/python3.5/contextlib.py", line 77, in exit\n self.gen.throw(type, value, traceback)\n File "/usr/lib/python3/dist-packages/setuptools/sandbox.py", line 166, in save_modules\n saved_exc.resume()\n File "/usr/lib/python3/dist-packages/setuptools/sandbox.py", line 141, in resume\n six.reraise(type, exc, self._tb)\n File "/usr/lib/python3/dist-packages/pkg_resources/_vendor/six.py", line 685, in reraise\n raise value.with_traceback(tb)\n File "/usr/lib/python3/dist-packages/setuptools/sandbox.py", line 154, in save_modules\n yield saved\n File "/usr/lib/python3/dist-packages/setuptools/sandbox.py", line 195, in setup_context\n yield\n File "/usr/lib/python3/dist-packages/setuptools/sandbox.py", line 243, in run_setup\n DirectorySandbox(setup_dir).run(runner)\n File "/usr/lib/python3/dist-packages/setuptools/sandbox.py", line 273, in run\n return func()\n File "/usr/lib/python3/dist-packages/setuptools/sandbox.py", line 242, in runner\n _execfile(setup_script, ns)\n File "/usr/lib/python3/dist-packages/setuptools/sandbox.py", line 46, in _execfile\n exec(code, globals, locals)\n File "/tmp/easy_install-qwhyslib/apipkg-1.5/setup.py", line 53, in \n name='taiyaki',\n File "/tmp/easy_install-qwhyslib/apipkg-1.5/setup.py", line 48, in main\n if any([cmd in sys.argv for cmd in ["install", "build", "build_clib", "build_ext", "bdist_wheel"]]):\n File "/usr/lib/python3.5/distutils/core.py", line 108, in setup\n _setup_distribution = dist = klass(attrs)\n File "/usr/lib/python3/dist-packages/setuptools/dist.py", line 272, in init\n _Distribution.init(self,attrs)\n File "/usr/lib/python3.5/distutils/dist.py", line 281, in init\n self.finalize_options()\n File "/usr/lib/python3/dist-packages/setuptools/dist.py", line 327, in finalize_options\n ep.load()(self, ep.name, value)\n File "/usr/lib/python3/dist-packages/pkg_resources/init.py", line 2229, in load\n return self.resolve()\n File "/usr/lib/python3/dist-packages/pkg_resources/init.py", line 2239, in resolve\n raise ImportError(str(exc))\n ImportError: module 'setuptools.dist' has no attribute 'check_specifier'\n ----------------------------------------\nERROR: Command "python setup.py egg_info" failed with error code 1 in /tmp/pip-req-build-gx3opexs/\n"}

megalodon_helper.MegaError: Modified bases database already exists.

Hi, while running megalodon on E. coli dataset from PRJEB22772 I'm getting below error. I'm reporting as the run will take ~5 days and not sure if it makes sense to keep if running. Any ideas?

$ megalodon PRJEB22772/MARC_ZFscreens_R9.4_1D-Ecoli-run_FAF05145/ --reference ref/mock_community.fa --output-directory megalodon/PRJEB22772/MARC_ZFscreens_R9.4_1D-Ecoli-run_FAF05145 --outputs basecalls mappings mods --devices 0 --verbose-read-progress 3

[17:36:25] Using canonical alphabet ACGT and modified bases Y=6mA Z=5mC.
[17:36:25] Loading reference.
[17:36:27] Preparing workers to process reads.
[17:36:27] Processing reads.
3 most common unsuccessful read types:
     -----
     -----
     -----
0read [00:00, ?read/s]Process Process-5:
Traceback (most recent call last):
  File "/home/lpryszcz/src/miniconda3/lib/python3.7/site-packages/megalodon/mods.py", line 135, in __init__
    '{} {}'.format(*ft) for ft in tbl.items()))))
sqlite3.OperationalError: database is locked

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/lpryszcz/src/miniconda3/lib/python3.7/multiprocessing/process.py", line 297, in _bootstrap
    self.run()
  File "/home/lpryszcz/src/miniconda3/lib/python3.7/multiprocessing/process.py", line 99, in run
    self._target(*self._args, **self._kwargs)
  File "/home/lpryszcz/src/miniconda3/lib/python3.7/site-packages/megalodon/mods.py", line 671, in _get_mods_queue
3 most common unsuccessful read types:emory)
    36.9% (   4138 reads) : No alignment                                                                    
     1.7% (    193 reads) : Likely out of memory error.                                                     
     -----megalodon_helper.MegaError: Modified bases database already exists. Either provide location for new database or open in read_only mode.
  2%|██▋                                                                                                                                               | 11249/619525 [2:09:53<117:03:21,  1.44read/s, ksamp/s=167]

error on merging per read databases

Hello,

I am trying to merge several per_read_modified_base_calls.db files on a 500GB ram machine. It also has 88 GB /tmp space available.

On a full dataset ,which is ~600 *.db each having the size of ~1.5 GB, I am getting below error after writing over 100GB merged db. Then it deletes the merged db. It also gives me "Stale file handle". Same happens if I use --mod-positions-on-disk option. I monitored the memory and /tmp usage. It doesn't use /tmp and memory usage was reasonable.

Traceback (most recent call last):

File "<path_to_megalodon>/megalodon/scripts/merge_per_read_modified_base_dbs.py", line 60, in
main()
File "<path_to_megalodon>/megalodon/scripts/merge_per_read_modified_base_dbs.py", line 54, in main
out_mods_db.create_data_covering_index()
File "~/tools/miniconda3/envs/megalodon_env/lib/python3.7/site-packages/megalodon/mods.py", line 373, in create_data_covering_index
self.cur.execute('CREATE INDEX data_cov_idx ON data(' +
sqlite3.OperationalError: database or disk is full

Then I tried merging 10 of the *db files, which worked fine. But when I tried merging ~200 *.db files, I got a different error:

Modified base per-read database file (done/per_read_modified_base_calls.db) does not exist.
Traceback (most recent call last):
File "<path_to_megalodon>/megalodon/scripts/merge_per_read_modified_base_dbs.py", line 60, in
main()
File "<path_to_megalodon>/megalodon/scripts/merge_per_read_modified_base_dbs.py", line 38, in main
mod_index_in_memory=False, uuid_index_in_memory=False)
File "~/tools/miniconda3/envs/megalodon_env/lib/python3.7/site-packages/megalodon/mods.py", line 105, in init
raise mh.MegaError('Invalid mods DB filename.')
megalodon.megalodon_helper.MegaError: Invalid mods DB filename.

Do you have any idea what could be wrong ?

Cheers

Megalodon 2.2 Performance and Updated per_read_modified_base_calls database schema

Hello,

Thank you for this new release of Megalodon. I've been doing some testing and I have 2 issues with it so far:

The read-processing performance went from ~28 reads/s (Megalodon 2.1 w/ Guppy 4.0.14) to ~9 reads/s (Megalodon 2.2 w/ Guppy 4.0.15), for the same outputs requested.
I tried applying the recommended Guppy params --num_callers 5 --ipc_threads 6 and even increasing the --gpu_runners_per_device but performance was not better. GPUs are 2x GeForce RTX 2080ti.
The per_read_modified_base_calls.db file doesn't contain the mod table anymore. Therefore, when calling multiple motifs (see Megalodon command), there is no way to know which motif was present at which position. Motif-dependent strand offset cannot be applied when performing downstream analysis. A solution would be to run Megalodon 1 time per motif and then add the offset when merging the outputs, but this would be quite inefficient... A new megolodon_extras script proposes to merge databases, is there one to split them?

Megalodon 2.2 Command:

megalodon ./final_fast5s/ --guppy-server-path ${GUPPY_DIR}/guppy_basecall_server \
        --guppy-params "-d ./rerio/basecall_models/ --num_callers 5 --ipc_threads 6" \
        --guppy-config res_dna_r941_min_modbases-all-context_v001.cfg \
        --outputs basecalls mappings mods per_read_mods mod_mappings \
        --output-directory ./res_mega \
        --reference $genomeFile \
        --sort-mappings \
        --mod-motif Z GCG 1 --mod-motif Z HCG 1 --mod-motif Z GCH 1 \
        --write-mods-text \
        --mod-aggregate-method binary_threshold \
        --mod-binary-threshold 0.7 \
        --mod-output-formats bedmethyl wiggle \
        --mod-map-emulate-bisulfite \
        --mod-map-base-conv C T --mod-map-base-conv Z C \
        --devices 0 --processes 30

Megalodon 2.1 Command:

megalodon ./final_fast5s/ --guppy-server-path ${GUPPY_DIR}/guppy_basecall_server \
        --guppy-params "-d ./rerio/basecall_models/" \
        --guppy-config res_dna_r941_min_modbases-all-context_v001.cfg \
        --outputs basecalls mappings mods per_read_mods mod_mappings \
        --output-directory ./megalodon_results \
        --reference $genomeFile \
        --mod-motif Z GCG 1 --mod-motif Z HCG 1 --mod-motif Z GCH 1 \
        --write-mods-text \
        --mod-aggregate-method binary_threshold \
        --mod-binary-threshold 0.7 \
        --mod-output-formats bedmethyl wiggle \
        --mod-map-base-conv C T --mod-map-base-conv Z C \
        --devices 0 --processes 30

Performance drop during Read Processing, maxing out "output queue capacity basecalls" but not I/O

Hello,

I experienced a significant drop in read processing performance (from ~22 reads/s to 10 reads/s and falling) with Megalodon after ~580 000 reads processed (~7h) when the output queue capacity basecalls was maxed out (10000/10000). The output states that this is a sign of I/O bottleneck but upon checking monitoring tools like iostat or iotop, the state of the drive (2TB NVMe) looked fine and not at all fully used. What could be going wrong?

Guppy Basecall Server
Guppy Basecall Service Software, (C) Oxford Nanopore Technologies, Limited. Version 4.0.14+8d3226e, client-server API version 2.1.0

Megalodon
Megalodon version: 2.1.1

Megalodon command

megalodon ./final_fast5s/ --guppy-server-path ${GUPPY_DIR}/guppy_basecall_server \
        --guppy-params "-d ./rerio/basecall_models/" \
        --guppy-config res_dna_r941_min_modbases-all-context_v001.cfg \
        --outputs basecalls mod_basecalls mappings mods per_read_mods mod_mappings \
        --output-directory ./mega_results/ \
        --reference $genomeFile \
        --mod-motif Z GCG 1 --mod-motif Z HCG 1 --mod-motif Z GCH 1 \
        --write-mods-text \
        --mod-aggregate-method binary_threshold \
        --mod-binary-threshold 0.875 \
        --mod-output-formats bedmethyl wiggle \
        --mod-map-base-conv C T --mod-map-base-conv Z C \
        --devices 0 --processes 30

IOstat

Device             tps    MB_read/s    MB_wrtn/s    MB_read    MB_wrtn
nvme0n1          87.25         0.73         4.88     439415    2923495

pytorch download url inaccessible

Receiving the following error when trying to install via the installation instructions. Previous step completed with no issues.

~/git/taiyaki$ make install 
rm -rf venv
virtualenv --python=python3 --prompt="(taiyaki) " venv
Running virtualenv with interpreter /home/grid/miniconda3/bin/python3
Using base prefix '/home/grid/miniconda3'
/usr/lib/python3/dist-packages/virtualenv.py:1086: DeprecationWarning: the imp module is deprecated in favour of importlib; see the module's documentation for alternative uses
  import imp
New python executable in /home/grid/git/taiyaki/venv/bin/python3
Also creating executable in /home/grid/git/taiyaki/venv/bin/python
Installing setuptools, pkg_resources, pip, wheel...done.
source venv/bin/activate && \
    python3 venv/bin/pip install pip --upgrade && \
    mkdir -p /home/grid/.cache/taiyaki/wheelhouse/ && \
    python3 venv/bin/pip download --dest /home/grid/.cache/taiyaki/wheelhouse/ http://download.pytorch.org/whl/cpu/torch-1.2.0-cp37-cp37m-manylinux1_x86_64.whl && \
    python3 venv/bin/pip install --find-links /home/grid/.cache/taiyaki/wheelhouse/ --no-index torch && \
    python3 venv/bin/pip install -r requirements.txt  && \
    python3 venv/bin/pip install -r develop_requirements.txt && \
    python3 setup.py develop
Requirement already up-to-date: pip in ./venv/lib/python3.7/site-packages (19.2.3)
Collecting torch==1.2.0 from http://download.pytorch.org/whl/cpu/torch-1.2.0-cp37-cp37m-manylinux1_x86_64.whl
  ERROR: HTTP error 403 while getting http://download.pytorch.org/whl/cpu/torch-1.2.0-cp37-cp37m-manylinux1_x86_64.whl
  ERROR: Could not install requirement torch==1.2.0 from http://download.pytorch.org/whl/cpu/torch-1.2.0-cp37-cp37m-manylinux1_x86_64.whl because of error 403 Client Error: Forbidden for url: http://download.pytorch.org/whl/cpu/torch-1.2.0-cp37-cp37m-manylinux1_x86_64.whl
ERROR: Could not install requirement torch==1.2.0 from http://download.pytorch.org/whl/cpu/torch-1.2.0-cp37-cp37m-manylinux1_x86_64.whl because of HTTP error 403 Client Error: Forbidden for url: http://download.pytorch.org/whl/cpu/torch-1.2.0-cp37-cp37m-manylinux1_x86_64.whl for URL http://download.pytorch.org/whl/cpu/torch-1.2.0-cp37-cp37m-manylinux1_x86_64.whl
make: *** [Makefile:48: install] Error 1

Installation Confusion

Hi dev team of Megalodon,

I am a bit confused as for what other components are needed for most up to date megalodon to function properly. My goal is to have 5mC in all sequence context. Currently our cluster have Guppy 4.0.11 installed. Should I be installing Rerio before installation of Megalodon? Or would Megalodon include the said trained model already? The documentation page did not explain it very clearly. Can you explain it a bit clear?

A). Just install this and it will work
B). Need Guppy to be installed first
C). Need Rerio to be installed first

Which one is it?

Guppy server initialization failed

Hi there,

I am trying to call for mod bases 5mC and 5hmC and I am having trouble initiating guppy.
I used the following command:
megalodon /home/fast5_pass --guppy-params "-d ./rerio/basecall_models/" --guppy-config res_dna_r941_min_modbases_5mC_5hmC_CpG_v001.cfg --guppy-server-path /home/hh/Software/ont-guppy-cpu/bin/guppy_basecall_server --mod-motif mh CG 0

And it fails to initiate guppy.

  File "/home/hh/miniconda3/lib/python3.7/site-packages/megalodon/backends.py", line 357, in start_guppy_server
    'Guppy server initialization failed. See guppy logs ' +
megalodon.megalodon_helper.MegaError: Guppy server initialization failed. See guppy logs in --output-directory for more details.

Unfortunately the error logs are not so helpful with the log informing me that

[guppy/error] main: Layer type 'GlobalNormTwoStateCatMod' mustbe last layer.
[guppy/warning] main: An error occurred initialising the basecaller. Aborting.

Not sure what I am missing? I tried re-installing guppy and rerio but no change. Any help/ info on calling for 5mC and 5hmC would be amazing.

Guppy basecaller Version 4.0.14+8d3226e
Ubuntu 16.04

confidence of identified CpG methylation

Hi!

I was running megalodon v.2.1.1 (installed via conda) on my R9.4 reads from 1 MinION runs as well as 1 PromethION run (total roughly 100x coverage of my target genome).

I ran this command:

megalodon \
--guppy-config dna_r9.4.1_450bps_modbases_dam-dcm-cpg_hac.cfg \
--guppy-server-path=/home/ubuntu/tools/ont-guppy/bin/guppy_basecall_server \
--devices 0 1 \
--processes $NUMTH \
--guppy-params "--num_callers 8 --ipc_threads 12" \
--mod-motif Z CG 0 \
--outputs "mods" \
--mod-output-formats "bedmethyl"  \
--mod-binary-threshold 0.5 \
--guppy-timeout 10.0 \
--output-directory ${REF%.fasta}_${SET1}_megalodon_5MC \
--reference $REF \
--overwrite \
$FAST5

This resulted in ~99.1% of all CpG positions being methylated when filtering for sites that occur in both runs (min & prom) even when being super stringent (--mod-binary-threshold 0.5).

This number seemed very high to me so I decided to test megalodon using a published Drosophila 1D PCR-free dataset (NCBI accession SRR8627923). Using the same settings as above resulted in 96% modified CpG, a number far off the published methylation % for Drosophila melanogaster. Since the authors of the Drosophila study also performed a MinION run on PCR amplified DNA I was able to compare the signals from both runs. The PCR library, which theoretically should show <1% methylation, however, showed 79% methylated CpG sites. The pure, noise free signal, accounts for roughly 17% methylated CpGs, a number closer but still way higher to what has been published for other adult Drosophila samples.

I am now wondering if
(1) the noise in nanopore methylation calls is always that high that I HAVE TO run a PCR sample as negative control?
(2) I have run a different model for the calls? (note: I got the exact same methylated CpGc for Drosophila when using 'res_dna_r941_min_modbases-all-context_v001.cfg' provided with the latest rerio repo)

Looking forward to some insights and advice on how to proceed with my own data.

Michael

using amplified cDNA reads as control for genomic reads?

Hi!

This is more a consideration rather than an issue.
I am wondering if it would be possible to use nanopore cDNA reads and compare the output of megalodon with the genomic reads to identify possibly false positive bases (for coding regions). I do not have a set of amplified genomic reads to use for this purpose. Would the current megalodon pipeline work with transcripts or will an adjustment have to made in the mapping step (mappy part of the code)?

Best
Michael

Question about the setting of “--processes"

If I use 48 core xeon cpu with dual NVIDIA Tesla V100 and 500GB RAM to process ONT fast5 data with megalodon，how many ”--processes“ should I set to use the full power of my computer and speed up the process of analysis?

Barcode support/ demultiplexing

Hi there,
Is there a way to demultiplex simultaneously while basecalling for mod bases using Megadolon?
:-)

Sequence summary table NA's

Hi there,

Thank you for this great tool!!

We are calling for mod bases 5mC and 5hmC in CpG context and have had great success so far. However our sequence summary text file is just a table of NAs. Not sure what is going on there? Command:
megalodon fast5/ --guppy-params "-d /home/hh/rerio/basecall_models" --guppy-config res_dna_r941_min_modbases_5mC_5hmC_CpG_v001.cfg --guppy-server-path /opt/ont/guppy/bin/guppy_basecall_server --outputs basecalls mods per_read_mods mod_basecalls mappings mod_mappings --write-mods-text --mappings-format bam --reference consensus.fasta --mod-motif mh CG 0 --output-directory megalodon

No errors or anything. We get all of the output we are expecting but the sequence_summary.txt is a massive NA table like this:

filename	read_id	run_id	batch_id	channel	mux	start_time	duration	num_events	passes_filtering	template_start	num_events_template	template_duration	sequence_length_template	mean_qscore_template	strand_score_template	median_template	mad_template	scaling_median_template	scaling_mad_template
NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA
NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA
NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA
NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA
NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA

It goes on like that for ages.....

Ubuntu 16.04
Megadolon installed with conda

Thanks in advance for your help!

Experimental RNA branch install procedure fails/instructions lacking

As mentioned at the bottom of the readme, direct RNA "support can be accessed within the rna github branch (access via git clone --branch rna https://github.com/nanoporetech/megalodon")." However, there aren't install instructions for how to install the RNA branch, and running setup.py fails as described below.

After cloning the branch, the install into a conda env with python 3.6 on Ubuntu 18.04.3 with

python setup.py build

fails with:

gcc -pthread -B /home/user/miniconda3/envs/megalodon_p36/compiler_compat -Wl,--sysroot=/ -Wsign-compare -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -fPIC -I/home/user/.local/lib/python3.6/site-packages/numpy/core/include -I/home/user/miniconda3/envs/megalodon_p36/include/python3.6m -c megalodon/_decode.c -o build/temp.linux-x86_64-3.6/megalodon/_decode.o -std=c99

The conda package of megalodon were already installed with
conda install -c megalodon
as were build pre-requisites with
sudo apt install build-essential python3-dev libevent-dev

How many alignments from a read are used?

By default minimap2 can generate secondary as well as supplementary alignments for each read. Which alignments are used for variant and modification calling? I'm guessing only the primary?

As a related question, is there any way to alter the minimap2 mapping parameters within a megalodon run?

Thanks!

input of already guppy called fast5s?

Hi,
We are considering processing 166 promethion runs with megaldon but can much more efficiently basecall the raw data on our GPU cluster (where we cannot install the guppy server). Is there a way to provide these -already mod called- fast5s to megaldon?

Ploidy

Is Megadolon able to call variants/modifications for heterozygot sites?

Thx,
F

Add benchmark results

Add baseline benchmark results for megalodon modified base and variant detection performance.

Install issue: Could not find a version that satisfies the requirement ont-pyguppy-client-lib

When trying to upgrade to the new Megalodon (V2.2) I receive the following message:
Could not find a version that satisfies the requirement ont-pyguppy-client-lib (from megalodon) (from versions: )
No matching distribution found for ont-pyguppy-client-lib (from megalodon)

megalodon --list-supported-guppy-configs doesn't return config list

Hello,

We have DNA R10.1 GridION microbial data for which we would like to use Megalodon to call 6mA and 5mC modified bases. However, it appears that the default guppy config file is dna_r9.4.1_450bps_modbases_dam-dcm-cpg_hac.cfg.

I am trying to check the other available config files using "megalodon --list-supported-guppy-configs". This command doesn't return a config file list for me; instead it returns the default argument list followed by "megalodon: error: the following arguments are required: fast5s_dir".

Any thoughts on if this is a bug, or if there is an installation problem?

Thanks,
Kim

Difficulties in installing Megalodon

When I type
pip install git+git://github.com/nanoporetech/megalodon.git

I receive the following error message.
Collecting git+git://github.com/nanoporetech/megalodon.git
Cloning git://github.com/nanoporetech/megalodon.git to /tmp/pip-req-build-ddvy0c6h
Running command git clone -q git://github.com/nanoporetech/megalodon.git /tmp/pip-req-build-ddvy0c6h
Requirement already satisfied: h5py>=2.2.1 in ./anaconda3/lib/python3.7/site-packages (from megalodon==0.1.0) (2.9.0)
Requirement already satisfied: numpy>=1.9.0 in ./anaconda3/lib/python3.7/site-packages (from megalodon==0.1.0) (1.16.4)
Requirement already satisfied: Cython>=0.25.2 in ./anaconda3/lib/python3.7/site-packages (from megalodon==0.1.0) (0.29.12)
Requirement already satisfied: mappy>=2.16 in ./anaconda3/lib/python3.7/site-packages (from megalodon==0.1.0) (2.17)
Collecting pysam>=0.15 (from megalodon==0.1.0)
Using cached https://files.pythonhosted.org/packages/15/e7/2dab8bb0ac739555e69586f1492f0ff6bc4a1f8312992a83001d3deb77ac/pysam-0.15.3.tar.gz
ERROR: Complete output from command python setup.py egg_info:
ERROR: # pysam: cython is available - using cythonize if necessary
# pysam: htslib mode is shared
# pysam: HTSLIB_CONFIGURE_OPTIONS=None
# pysam: (env) CC=/condo/ieg/nxiao6/anaconda3/bin/x86_64-conda_cos6-linux-gnu-cc
# pysam: (env) CFLAGS=-march=nocona -mtune=haswell -ftree-vectorize -fPIC -fstack-protector-strong -fno-plt -O2 -pipe
# pysam: (env) LDFLAGS=-Wl,-O2 -Wl,--sort-common -Wl,--as-needed -Wl,-z,relro -Wl,-z,now
checking for gcc... /condo/ieg/nxiao6/anaconda3/bin/x86_64-conda_cos6-linux-gnu-cc
checking whether the C compiler works... yes
checking for C compiler default output file name... a.out
checking for suffix of executables...
checking whether we are cross compiling... no
checking for suffix of object files... o
checking whether we are using the GNU C compiler... yes
checking whether /condo/ieg/nxiao6/anaconda3/bin/x86_64-conda_cos6-linux-gnu-cc accepts -g... yes
checking for /condo/ieg/nxiao6/anaconda3/bin/x86_64-conda_cos6-linux-gnu-cc option to accept ISO C89... none needed
checking for ranlib... /condo/ieg/nxiao6/anaconda3/bin/x86_64-conda_cos6-linux-gnu-ranlib
checking for grep that handles long lines and -e... /usr/bin/grep
checking for C compiler warning flags... -Wall
checking for special C compiler options needed for large files... no
checking for _FILE_OFFSET_BITS value needed for large files... no
checking for _LARGEFILE_SOURCE value needed for large files... no
checking shared library type for unknown-Linux... plain .so
checking how to run the C preprocessor... /condo/ieg/nxiao6/anaconda3/bin/x86_64-conda_cos6-linux-gnu-cpp
checking for egrep... /usr/bin/grep -E
checking for ANSI C header files... yes
checking for sys/types.h... yes
checking for sys/stat.h... yes
checking for stdlib.h... yes
checking for string.h... yes
checking for memory.h... yes
checking for strings.h... yes
checking for inttypes.h... yes
checking for stdint.h... yes
checking for unistd.h... yes
checking for stdlib.h... (cached) yes
checking for unistd.h... (cached) yes
checking for sys/param.h... yes
checking for getpagesize... yes
checking for working mmap... yes
checking for gmtime_r... yes
checking for fsync... yes
checking for drand48... yes
checking whether fdatasync is declared... yes
checking for fdatasync... yes
checking for library containing log... -lm
checking for zlib.h... no
checking for inflate in -lz... no
configure: error: zlib development files not found

HTSlib uses compression routines from the zlib library <http://zlib.net>.
Building HTSlib requires zlib development files to be installed on the build
machine; you may need to ensure a package such as zlib1g-dev (on Debian or
Ubuntu Linux) or zlib-devel (on RPM-based Linux distributions or Cygwin)
is installed.

FAILED.  This error must be resolved in order to build HTSlib successfully.
checking for gcc... /condo/ieg/nxiao6/anaconda3/bin/x86_64-conda_cos6-linux-gnu-cc
checking whether the C compiler works... yes
checking for C compiler default output file name... a.out
checking for suffix of executables...
checking whether we are cross compiling... no
checking for suffix of object files... o
checking whether we are using the GNU C compiler... yes
checking whether /condo/ieg/nxiao6/anaconda3/bin/x86_64-conda_cos6-linux-gnu-cc accepts -g... yes
checking for /condo/ieg/nxiao6/anaconda3/bin/x86_64-conda_cos6-linux-gnu-cc option to accept ISO C89... none needed
checking for ranlib... /condo/ieg/nxiao6/anaconda3/bin/x86_64-conda_cos6-linux-gnu-ranlib
checking for grep that handles long lines and -e... /usr/bin/grep
checking for C compiler warning flags... -Wall
checking for special C compiler options needed for large files... no
checking for _FILE_OFFSET_BITS value needed for large files... no
checking for _LARGEFILE_SOURCE value needed for large files... no
checking shared library type for unknown-Linux... plain .so
checking how to run the C preprocessor... /condo/ieg/nxiao6/anaconda3/bin/x86_64-conda_cos6-linux-gnu-cpp
checking for egrep... /usr/bin/grep -E
checking for ANSI C header files... yes
checking for sys/types.h... yes
checking for sys/stat.h... yes
checking for stdlib.h... yes
checking for string.h... yes
checking for memory.h... yes
checking for strings.h... yes
checking for inttypes.h... yes
checking for stdint.h... yes
checking for unistd.h... yes
checking for stdlib.h... (cached) yes
checking for unistd.h... (cached) yes
checking for sys/param.h... yes
checking for getpagesize... yes
checking for working mmap... yes
checking for gmtime_r... yes
checking for fsync... yes
checking for drand48... yes
checking whether fdatasync is declared... yes
checking for fdatasync... yes
checking for library containing log... -lm
checking for zlib.h... no
checking for inflate in -lz... no
configure: error: zlib development files not found

HTSlib uses compression routines from the zlib library <http://zlib.net>.
Building HTSlib requires zlib development files to be installed on the build
machine; you may need to ensure a package such as zlib1g-dev (on Debian or
Ubuntu Linux) or zlib-devel (on RPM-based Linux distributions or Cygwin)
is installed.

FAILED.  This error must be resolved in order to build HTSlib successfully.
make: ./version.sh: Command not found
make: ./version.sh: Command not found
config.mk:2: *** Resolve configure error first.  Stop.
# pysam: htslib configure options: None
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/tmp/pip-install-a6k09_wr/pysam/setup.py", line 241, in <module>
    htslib_make_options = run_make_print_config()
  File "/tmp/pip-install-a6k09_wr/pysam/setup.py", line 68, in run_make_print_config
    stdout = subprocess.check_output(["make", "-s", "print-config"])
  File "/condo/ieg/nxiao6/anaconda3/lib/python3.7/subprocess.py", line 395, in check_output
    **kwargs).stdout
  File "/condo/ieg/nxiao6/anaconda3/lib/python3.7/subprocess.py", line 487, in run
    output=stdout, stderr=stderr)
subprocess.CalledProcessError: Command '['make', '-s', 'print-config']' returned non-zero exit status 2.
----------------------------------------

ERROR: Command "python setup.py egg_info" failed with error code 1 in /tmp/pip-install-a6k09_wr/pysam/

Seek suggestions on variant calling?

Hello, megalodon 2.0.0 requires a variant file in VCF format (--variant-filename VARIANT_FILENAME) for variant calling.
Can you provide advices on how to prepare that variant file?

--mod-motif Z CG 0 produces incorrect logging and output files

When running megalodon with --mod-motif Z CG 0, the logs state

Using canonical alphabet ACGT and modified bases Y=6mA (alt to A); Z=5mC (alt to C).

and produces an empty modified_bases.6mA.bed file.

It would also be nice to clarify what the Z and 0 indicate as it's not apparent from the documentation.

RNA methylation config file

Hello, I want to identify RNA methylation. But the config file, dna_r9.4.1_450bps_modbases_dam-dcm-cpg_hac_prom.cfg, is for DNA rather than RNA when basecalling. I tried to use rna_r9.4.1_70bps_fast.cfg. But it did not work because of no methylation information in the file. Therefore, I want to know how to set the config option ( --guppy-config) for the identification of RNA methylation? Thanks!

All context mC detection doable in 2.2 now?

I just skimmed across the update notes and documentation page. Does it look like we can specify mC context now? Like mCpG, mGpC, mCWG and etc?

thresholding the scores from per-read-mods

Hi Marcus,
Do you have a recommendation for a threshold to call modified/unmodified from the output log probability per position per read?

Thanks,
Roham

Degenerate bases and megalodon

Hello!

I've got an idea that I think would work, but I need your help.

https://www.biorxiv.org/content/10.1101/645903v1

I'm trying to replicate this paper's results using patterned UMIs. These UMIs are regions of my primers that have unknown sequences but known length (18bp) and known restrictions on possible sequences ("NNNYRNNNYRNNNYRNNN"). Each UMI is flanked by a known 20bp sequences, making the region easy to identify.

Accurate per-read basecalling of these regions is critical. It seems to me that we should be able to use the pattern to aid basecalling. (In other words, we should be able to answer the question 'Over a given series of observed events, what is the most likely sequence that fits the UMI pattern?')

Do you think megalodon could help here? I've tried making a vcf file of my UMI but I don't know if I did it appropriately for megalodon.

Best,
Joe

Setting a Hard threshold for modified base aggregation

Hello,

I am trying to use of the --mod-binary-threshold option in Megalodon's mod base aggregation.
It is described as (probability of modified/canonical base) and has a default value of 0.75. My main question is: how is it calculated exactly?

I would like to use other Megalodon outputs to calculate the threshold that fits best my data and my analysis, however I haven't succeeded yet.

My process was:

For 2 libraries (fully methylated and negative control) processed by Megalodon
from per_read_modified_base_calls.txt calculate per_read_scores = mod_log_probs - can_log_prob.
Set a range of thresholds, for each one calculate the methylation proportion (#methylated sites/#non-methylated sites)in each sample.
Get TP: True Positives (Proportion of Methylated Sites in Methylated library)
Get FP: False Positives (Proportion of Methylated Sites in Negative Control)
Pick the threshold that maximize TP-FP. Aggregate real data with this threshold.

I tried this and per_read_scores = exp(mod_log_probs - can_log_prob). For the log ratio I got a threshold of 0.89, for the exp(log ratio) I got a threshold of 2.42.

I found 0.89 gave me too many FP still, while 2.42 didn't count any site as methylated.

How to get the bed file from per_read_modified_base_calls.db

Hi,
When I analyzed the DNA methylation in human , I divided the original files into multiple parts to run megalodon v1.0.2. After that I got multiple sets of

log.txt
mappings.bam
modified_bases.5mC.bed
modified_bases.6mA.bed
per_read_modified_base_calls.db
per_read_modified_base_calls.txt file.

Then I used merge_per_read_modified_base_dbs.py to merge per_read_modified_base_calls.dbs, and I got megalodon_merge_mods_results/per_read_modified_base_calls.db.

I hope get bed file finally, but I don't know how to do. Can you give some advise on these?

nanoporetech / megalodon Goto Github PK

megalodon's Introduction

Megalodon

Prerequisites

Installation

Getting Started

Inputs

Outputs

Live Processing

Guppy Models and Parameters

Disk Performance Considerations

High Quality Phased Variant Calls

High-Density Variants

RNA

License and Copyright

Research Release

megalodon's People

Contributors

Stargazers

Watchers

Forkers

megalodon's Issues

Megalodon 2.2 Command:

Megalodon 2.1 Command:

Recommend Projects

Recommend Topics

Recommend Org