novoalab / modphred Goto Github PK

View Code? Open in Web Editor NEW

15.0 15.0 1.0 49.08 MB

modPhred is a pipeline for detection of DNA/RNA modifications from raw ONT data

Home Page: https://modphred.readthedocs.io

License: MIT License

Jupyter Notebook 35.07% Python 63.73% Dockerfile 1.20%

basecalling dna fast5 guppy modifications nanopore on-the-fly rna

modphred's People

Contributors

Stargazers

Watchers

Forkers

avilella

modphred's Issues

treat introns properly in mod_cluster.py --minAlgFrac

Modphred hanging

Hello, I run modphred with the test data in the tutorial as follows:

run -f ref/ECOLI.fa -o OUTPUT -i PRJEB22772/* -t4 \
-c dna_r10.3_450bps_fast.cfg \
--host /guppy/bin/guppy_basecall_server

However, even though the guppy_basecall_server is running, modphred seems to be expecting for a different output in the guppy log and just waits indefinitely.

The guppy log indicates the basecall server is running as:

Starting server on port: ipc:///tmp/07f7-506f-964b-2342

So my question is whether modphred is expecting the basecall server to open a TCP port instead of the above?

Thanks!

modphred and RNA004

Dear all,

will modphred work with RNA004-data basecalled with Dorado to detect RNA m6A?

Thanks and best
Matthias

clustering

Hi. Using a container, we have your program running with the provided test dataset-it's beautiful!.

My question regards how to best get the clustering working. I have only been doing bioinformatics a year so I apologize for the naive question. How should I specify a short region to analyze the clustering on? Is it incorporated into the mod_cluster.py command? Or, is it specified in the reference genome and the corresponding index? Or some other way?

thanks for the help.

/opt/modPhred/run

Hi,
this is just an advice to add /opt/modPhred/run to the $PATH within the docker image so you don't need to know this internal path and use it directly from the docker image.

Luca

how to specify input for read_cluster.py?

Hello,
I want to run read clustering on different samples separately (2 CTR and 2 KO), which were originally processed together so I have a common mod.gz for the 4 samples.
I tried to input a "subset" mod.gz with only the columns for CTR samples but the output still contains the reads from the KO samples... how can I specify what input samples I want to run the clustering on?

Hard to replicate with new models

First off, great work! it's always really nice when things work straight out of the box and really shows how much care you put into it if a beginner like me can get it to work. It also tackles a really important need in the nanopore methylation calling repertoire.

The model you use the dam-dcm specific one was removed from guppy 4.5.2 and the currently available models struggle to replicate your test data results. Since the ONT software download page forces your to download the latest guppy newer users won't be able to use the same basecaller. This is not really an issue of your package but I wasn't sure where to post it because this solution was a little harder to come by than it should have been. For people looking for older versions for linux just change the version number to whichever version you want to install, 4.4.2 is the latest distribution with the dam-dcm still available:
wget https://mirror.oxfordnanoportal.com/software/analysis/ont-guppy_4.4.2_linux64.tar.gz
With the dam-dcm model I was once again able to replicate your test data.

Currently available models in guppy 4.5.2 and rerio:
Guppy: dna_r9.4.1_450bps_modbases_5mc_hac.cfg causes most of the 5mC to be CpGs even in bacteria... either we stumbled onto some amazing bacterial biology or it's just not working. (NB this 5mC model is the same as the 5mC one in rerio)
Rerio: res_dna_r941_min_modbases-all-context_v001.cfg gets a lot closer but below you can see that the methylations are more spread and less accurate.

EDIT: Bottom panel is with the rerio model, top panel is with your pre-basecalled dataset (sorry got that switched up pre-edit)

Got to be honest, it's a little worrisome how much difference the basecalling model can make. Definitely something I need to spend more time looking into. But at least this is a quick fix for other people who have the latest version of guppy and were struggling like me.

And thanks again for putting this package together.

Using modPhred on DRS data

Hi,

I think your scripts are great and it is cool that you provided a pretrained modification-aware model. I would like to ask some questions regarding modPhred.

I've encountered a few issues while testing your pipeline on my DRS data. Could not solve them by myself, so here I am.

First of all, I cloned the repo and installed all deps according to your instructions. Yet I was unable to run the pipeline (all your scripts at once) with basecalling on-the-fly. Neither using my own data or test dataset pointed out in the docs. The problem is "~/src/modPhred/run" part of the command - I get an error message pointing out that there is no such file or directory. And, actually, there is no file called run or run.py within the repo. I checked multiple times. I ran the command from bash and from python - nothing worked. Am I missing something?

(my bash version: 5.0.17; my python version: 3.8.3, pyguppyclient version:0.0.6)

Nevertheless, I was able to run the scripts one by one on your prebasecalled test data (~/PRJEB22772/MARC_ZFscreens_R9.4_1D-Ecoli) and yield a proper output.

So I decided to basecall my own RNA data with guppy 3.5.1 and your pretrained model (rna_r9.4.1_70bps_m6A-m5C-5hmC_hac.cfg). Since I could not do it on the fly, I basecalled my dataset standalone, with --trim_strategy none --reverse_sequence on --u substitution on --fast5_out. Then I used the output as an input in modPhred.

Unfortunately, guppy_align & mod_report do not work as expected. The produced bamfile contains expected total amount of reads, but - according to samtools flagstat & IGV, none of them is mapped to the reference genome (I have also called the same sample with standard ont model (modification unaware) and the majority of reads were recognized as mapped). In fact, I have tried multiple samples from various organisms and the result was always the same. The entire dataset in bamfile was considered as "unmapped", resulting mod.gz and bedfile were empty, and plot could not be yielded because the script was unable to map the reads.

Could you provide an example of usage on RNA data and/or the guppy command line? I would appreciate that very much.

Best,
N.

Silently failing to map reads

Hi,

When I run modphred on two pre basecalled samples it will get up to aligning and then fail silently, try to continue to samtools sorting but no bam exists.

When I look in the minimap2 workspace.log I can see only two lines:

[M::mm_idx_gen::49.9551.50] collected minimizers
[M::mm_idx_gen::56.6983.16] sorted minimizers

Running guppy_align.py by itself I see similar behaviour, minimap2 runs in processes even after the guppy_align.py script has terminated, with no other output other than the log like above.

Is there an easy way of manually aligning and sorting with samtools to move on with the later steps of the pipeline?
I've run the samples individually with no issue but would like to compare and contrast them. Is there a way I can concatenate the individual runs together to do some visualisations?

Thanks for any help!

package required for mod_cluster.py not included in the image

Even though I used the singularity image I got this error when running mod_cluster.py:

Traceback (most recent call last):
File "/users/enovoa/scruciani/soft/modPhred/src/mod_cluster.py", line 21, in
from sklearn.decomposition import PCA
ModuleNotFoundError: No module named 'sklearn'

--> should add scikit-learn to the image?

Connection error

Hello,

We are getting an error in linux using your test example (using guppy basecaller). The test example using basecalled data works fine but when doing the second example:

run -f ref/ECOLI.fa -o PRJEB22772 -i PRJEB22772/* -t4
--host /guppy/4.2.2/bin/guppy_basecall_server

we get the following error:

ConnectionError: Connect with 'dna_r9.4.1_450bps_modbases_dam-dcm-cpg_hac' failed: [bad_reply] Could not interpret message from server for request: LOAD_CONFIG. Reply: INVALID_PROTOCOL
[2022-11-10 05:37:06.553666] [0x00002aaab5e8e700] [info] Connection error. [bad_reply] Could not interpret message from server for request: LOAD_CONFIG. Reply: INVALID_PROTOCOL

Could you please provide some guidance on this error?

Thanks!

KeyError: 'MaxPhredProb' in mods_from_bams.py script

Dear developers,
I'm trying to run the mods_from_bams.py script.
I run the following command:
(modPhred) diaz@Uwe2:/media/data/nuria/bcn_nanopore/files$ ~/src/modPhred/src/mods_from_bams.py -i sample1.bam sample2.bam -f GRCm39.genome.fa -o ./sample1_vs_sample2

In the "modPhred" conda environment I've all the dependencies described in your documentation.

However I get the following error:

[2023-08-01 17:47:04] ===== Welcome, welcome to modPhred pipeline! =====
Traceback (most recent call last):
  File "/home/diaz/src/modPhred/src/mods_from_bams.py", line 86, in <module>
    main()
  File "/home/diaz/src/modPhred/src/mods_from_bams.py", line 59, in main
    MaxPhredProb = data["MaxPhredProb"]
KeyError: 'MaxPhredProb'

Could you be so kind to give me a hint to run the code?
Thanks in advance and best regards,
Núria

Unable to find executables/scripts after conda installation

Hi everyone,

I just installed modPhred via conda, and would now like to run the test data.
I would expect that I have the relevant scripts like modEncode or modAlign in the path... but I do not.
So... how do I run them/where are they?

Thanks,
Bastian