weir12 / dena Goto Github PK

View Code? Open in Web Editor NEW

22.0 22.0 5.0 330.65 MB

Deep learning model used to detect RNA m6a with read level based on the Nanopore direct RNA data.

License: MIT License

Shell 4.15% R 0.85% Python 95.00%

nanopore

dena's People

Contributors

Stargazers

Watchers

Forkers

q1134269149 zhouhui0916 xuweixw hamzaib2 harel-coffee

dena's Issues

Problem with LSTM_extract.py I

Hello,
I am trying to use DANA. In step one, with LSTM_extract.py I get the following error, and I don't know why. I am using the command line of your tutorial, adapted to my data:

Traceback (most recent call last):
File "/home/mario/Programas/DENA/DENA-release/step4_predict/LSTM_extract.py", line 21, in
from tombo import tombo_helper, tombo_stats, resquiggle
File "/home/mario/anaconda3/lib/python3.11/site-packages/tombo/tombo_stats.py", line 84, in
HALF_NORM_EXPECTED_VAL = stats.halfnorm.expect()
^^^^^^^^^^^^^^^^^^^^^^^
File "/home/mario/anaconda3/lib/python3.11/site-packages/scipy/stats/_distn_infrastructure.py", line 2914, in expect
dub = integrate.quad(fun, d, ub, **kwds)[0]
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/mario/anaconda3/lib/python3.11/site-packages/scipy/integrate/_quadpack_py.py", line 465, in quad
retval = _quad(func, a, b, args, full_output, epsabs, epsrel, limit,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/mario/anaconda3/lib/python3.11/site-packages/scipy/integrate/_quadpack_py.py", line 579, in _quad
return _quadpack._qagie(func,bound,infbounds,args,full_output,epsabs,epsrel,limit)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/mario/anaconda3/lib/python3.11/site-packages/scipy/stats/_distn_infrastructure.py", line 2891, in fun
return x * self.pdf(x, *args, **lockwds)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/mario/anaconda3/lib/python3.11/site-packages/scipy/stats/_distn_infrastructure.py", line 1992, in pdf
place(output, cond, self._pdf(*goodargs) / scale)
^^^^^^^^^^^^^^^^^^^^
File "/home/mario/anaconda3/lib/python3.11/site-packages/scipy/stats/_continuous_distns.py", line 4253, in _pdf
return np.sqrt(2.0/np.pi)np.exp(-xx/2.0)
^^^^^^^^^^^^^^^^
FloatingPointError: underflow encountered in exp

"UnboundLocalError: local variable 'f4' referenced before assignment" in LSTM_predict.py

Thanks for the quick response to previous question, and sorry for the late reply due to Chinese new year.

I successfully downloaded the trained model, but while I performed the LSTM_predict.py, I received the following message:

(DENA) ycc@A326:~$ python DENA/step4_predict/LSTM_predict.py -i /mnt/858ed1f5-8772-42cf-9d70-f9dc3b880628/WT_Dark_R1/ \
>                                                    -m ~/DENA/DENA-lstm/ \
>                                                    -o /mnt/858ed1f5-8772-42cf-9d70-f9dc3b880628/WT_Dark_R1/ \
>                                                    -p WT_Dark_R1_predict
[PosixPath('/home/ycc/DENA/DENA-lstm/AGACA/model_best.pth'), PosixPath('/home/ycc/DENA/DENA-lstm/AGACC/model_best.pth'), PosixPath('/home/ycc/DENA/DENA-lstm/GGACT/model_best.pth'), PosixPath('/home/ycc/DENA/DENA-lstm/AGACT/model_best.pth'), PosixPath('/home/ycc/DENA/DENA-lstm/AAACC/model_best.pth'), PosixPath('/home/ycc/DENA/DENA-lstm/GAACC/model_best.pth'), PosixPath('/home/ycc/DENA/DENA-lstm/GGACA/model_best.pth'), PosixPath('/home/ycc/DENA/DENA-lstm/GGACC/model_best.pth'), PosixPath('/home/ycc/DENA/DENA-lstm/GAACA/model_best.pth'), PosixPath('/home/ycc/DENA/DENA-lstm/AAACA/model_best.pth'), PosixPath('/home/ycc/DENA/DENA-lstm/AAACT/model_best.pth'), PosixPath('/home/ycc/DENA/DENA-lstm/GAACT/model_best.pth')]
We found 12 well-trained model file
Traceback (most recent call last):
  File "DENA/step4_predict/LSTM_predict.py", line 118, in <module>
    main()
  File "DENA/step4_predict/LSTM_predict.py", line 110, in main
    if f4:
UnboundLocalError: local variable 'f4' referenced before assignment

Best regards
YCCHEN

argparse problem

Hi, I'm trying DENA to predict m6A of my data, but I encountered problem at the first step.

python3 ./DENA-public/step4_predict/LSTM_extract.py get_pos --fasta ~/Desktop/Genome/Arabidopsis/cDNA/transcripts.fa LSTM_ --motif 'RRACH' LSTM_ --output /mnt/858ed1f5-8772-42cf-9d70-f9dc3b880628/candidate_predict_pos.txt

The error message:
Usage: LSTM_extract.py [-h] {get_pos,predict} ...
LSTM_extract.py: Error: unrecognized arguments: LSTM_ LSTM_

It seems an argparse problem? How could I resolve it? Thanks!

Issues with DENA's output

1.Some tmp files are empty during feature extraction.

2.The total reads in the final result table are low. Were some reads filtered out?

DENA model downloading problem

Hi,
I tried to download the pre-trained models via git lfs, but it displayed the following messages:

(DENA) ycc@A326:~$ git clone https://github.com/weir12/DENA.git
Cloning into 'DENA'...
remote: Enumerating objects: 599, done.
remote: Counting objects: 100% (599/599), done.
remote: Compressing objects: 100% (443/443), done.
remote: Total 599 (delta 226), reused 336 (delta 103), pack-reused 0
Receiving objects: 100% (599/599), 330.51 MiB | 13.62 MiB/s, done.
Resolving deltas: 100% (226/226), done.
Downloading DENA_model_lib/lstm_212/AAACA/model_best.pth (68 MB)
Error downloading object: DENA_model_lib/lstm_212/AAACA/model_best.pth (56f64df): Smudge error: Error downloading DENA_model_lib/lstm_212/AAACA/model_best.pth (56f64df07d383a2f554c0775e4b74d880d13cbd43aedd8879237431f255d02e9): batch response: This repository is over its data quota. Account responsible for LFS bandwidth should purchase more data packs to restore access.

Errors logged to /home/ycc/DENA/.git/lfs/logs/20220127T221723.082502684.log
Use `git lfs logs last` to view the log.
error: external filter 'git-lfs filter-process' failed
fatal: DENA_model_lib/lstm_212/AAACA/model_best.pth: smudge filter lfs failed
warning: Clone succeeded, but checkout failed.
You can inspect what was checked out with 'git status'
and retry with 'git restore --source=HEAD :/'

Is there any other way can obtain the models?

Willing to provide more details if needed, many thanks!

Best regards
YCCHEN

No modification value in the output data

Hi, Thank you for developing a useful tool. I have run DENA with my dataset. The output files were successfully generated, but there are no modification values. I am not sure what the problem is. Could you please give me an advice to fix the problem?

wt.tsv

ENST00000377350 2320    AAACA   0       1       0.0
ENST00000380668 473     AAACA   0       8       0.0
ENST00000380668 818     AAACA   0       8       0.0
ENST00000380668 846     AAACA   0       8       0.0
ENST00000398491 482     AAACA   0       1       0.0
ENST00000398491 827     AAACA   0       1       0.0
ENST00000398491 855     AAACA   0       1       0.0
ENST00000401827 401     AAACA   0       2       0.0
ENST00000267163 154     AAACA   0       6       0.0
ENST00000267163 368     AAACA   0       6       0.0
ENST00000267163 721     AAACA   0       7       0.0

wt_details.tsv

ENST00000277900 904     AAACA
dce44906-2feb-47e7-a795-a05b923055a2    
ENST00000277900 1309    AAACA
dce44906-2feb-47e7-a795-a05b923055a2    
ENST00000277900 1484    AAACA
dce44906-2feb-47e7-a795-a05b923055a2    
ENST00000277900 1552    AAACA
dce44906-2feb-47e7-a795-a05b923055a2    
ENST00000277900 1716    AAACA
dce44906-2feb-47e7-a795-a05b923055a2    
ENST00000277900 1874    AAACA
dce44906-2feb-47e7-a795-a05b923055a2    
ENST00000424679 908     AAACA
8c186437-137f-435b-a26f-24c35e5ce9f9    
ENST00000424679 974     AAACA
8c186437-137f-435b-a26f-24c35e5ce9f9    
ENST00000424679 1068    AAACA
8c186437-137f-435b-a26f-24c35e5ce9f9

The following is my command line. I just wanted to note that I didn't add -corr_grp ${RawGenomeCorrected_000} in the extraction features step. Because my Tombo re-squiggling process with --corrected-group RawGenomeCorrected_001 --basecall-group Basecall_1D_001 --include-event-stdev --overwrite --ignore-read-locks failed, I run re-squiggle without those options and run DENA as follows. I am wondering if that could lead to the problem.

#Extract features
python3 /home/euijin.kwon-umw/Euijin/DENA/step4_predict/LSTM_extract.py predict --fast5 ${wt_fast5} --bam ${wt_bam} --processes 16 --sites ${candidate_predict_pos} --label wt --windows 2 2

#Predict
python /home/euijin.kwon-umw/Euijin/DENA/step4_predict/LSTM_predict.py -i . -m ${DENA} -o output -p wt -d

I am looking forward to hearing from you. Thank you!

“LSTM_extract.py” problem

Hi,
I tried to run this command in step 3:
python3 LSTM_extract.py --fast5 ${fast5_fn} --corr_grp ${RawGenomeCorrected_000} --bam ${bam_fn} --sites ${candidate_predict_pos.txt} --label ${any meaningful string} --windows 2 2

I used the command: python3 LSTM_extract.py --fast5 con-output/workspace/0/ --corr_grp ./con-output/RawGenomeCorrected_001 --bam ./con-basecalls.bam --sites ./candidate_predict_pos.txt --label LSTM-control --windows 2 2

But I received the following message:
Warning: The BRI module could not be loadedUsage: LSTM_extract.py [-h] {get_pos,predict} ...
LSTM_extract.py: Error: argument command: invalid choice: 'con-output/workspace' (choose from 'get_pos', 'predict')

How could I resolve it? Thanks!
ZFZHANG

Fail with step3 LSTM_extract.py

Dear DENA maker(s),
I have trouble with the third part of the pipeline (LSTM_extract.py step).
After running I got for all process, empty temporary files (0_tmp to 109_tmp) but no final output.
My command is:

python3 LSTM_extract.py predict --fast5 /fast5_workspace/ --corr_grp RawGenomeCorrected_001 --bam /basecalls.bam --sites candidate_predict_pos.txt --label test --windows 2 2 --processes 110

Do you have an idea of my issue?

Thanks for the help

The log says
[17:49:29] Parsing Tombo index file(s).
[17:49:29] Parsing Tombo index file(s).
[17:49:29] Parsing Tombo index file(s).
[17:49:29] Parsing Tombo index file(s).
[17:49:29] Parsing Tombo index file(s).
[17:49:29] Parsing Tombo index file(s).
[17:49:29] Parsing Tombo index file(s).
[17:49:29] Parsing Tombo index file(s).
[17:49:29] Parsing Tombo index file(s).
... (lot of parsing)

processes_99: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 60160/60160 [00:01<00:00, 35667.92it/s]

... (lot of bar)

None
None
None
None
None
None
None
None

... (lot of None)

Missing path model files

Hi,
I tried to use your software to perform some analysis, I followed all the steps until the number 4 (prediction) but on the GitHub page I can't find the path model files (.dat or .pkl) needed for the -m argument. Could you please provide them? Thanks in advance.

Fabio

The scope of application of the DENA

Hi developer,
Thank you for your tools, it's very nice. I have finished running my programs on human tissues, but I have a question.
Could DENA be used to analyze human tissues?
Thanks!

Best wishes,
zhang zaifeng

The common m6A sites

Hi,
As the paper shown, 3106 common m6A sites identified by differr tool across Cm, Cf, and Vv, Could this sites list file be shared please? Is there any way can obtain it? Thanks in advance.
xu

ouput explaination of LSTM_predict.py

HI,
Thanks for the great work you've done! I am using this software and get two tsv files after runing LSTM_predict.py.

As shown in the mainpage, the tsv files have transcript id and m6A sites, but what do the rest columns mean ? would appreciate some more detained explainations about the output files.

  Thanks.

about m6A validation set

Hi DENA team! I' d want to know how you get the 4685 miCLIP sites identified in Nanopore direct RNA sequencing maps the complexity of Arabidopsis mRNA processing and m6A modification, while I found that the number of miCLIP sites identified in Nanopore direct RNA sequencing maps the complexity of Arabidopsis mRNA processing and m6A modification was 93046.

DENA output from human data

Hi,

Thank you for providing a useful tool.

I found you have uploaded m6A output for A. thaliana in the supplementary data. However, I couldn't find m6a output for A. human data (HEK293T replicate 1 and 3) . Could you please share the output data?
Thank you so much!

empty tmp file of LSTM_extract.py

Hi,
thanks for developing DENA. I run the LSTM_extract.py predict step to extract features of my data, using 72 cups, and I got 72 tmp files, but most of them were empty files, and I got only features of 24 reads.

my command was:
python LSTM_extract.py predict \ --processes 20 --fast5 fast5/ \ --corr_grp RawGenomeCorrected_001 --bam sort.bam \ --sites ../filt_candidate_predict_pos.txt --label test --windows 2 2

the log file said:
[20:45:17] Parsing Tombo index file(s).
[20:45:17] Parsing Tombo index file(s).
[20:45:17] Parsing Tombo index file(s).
[20:45:17] Parsing Tombo index file(s).
[20:45:17] Parsing Tombo index file(s).
[20:45:17] Parsing Tombo index file(s).
[20:45:17] Parsing Tombo index file(s).
[20:45:17] Parsing Tombo index file(s).
[20:45:17] Parsing Tombo index file(s).
[20:45:17] Parsing Tombo index file(s).
[20:45:17] Parsing Tombo index file(s).
[20:45:17] Parsing Tombo index file(s).
[20:45:17] Parsing Tombo index file(s).
[20:45:17] Parsing Tombo index file(s).
[20:45:17] Parsing Tombo index file(s).
[20:45:17] Parsing Tombo index file(s).
processes_30: 100%|██████████| 3358/3358 [7:49:22<00:00, 8.39s/it]
processes_0: 100%|██████████| 3358/3358 [7:42:10<00:00, 1.49it/s]
processes_1: 100%|██████████| 3358/3358 [4:26:06<00:00, 4.56s/it]
processes_2: 100%|██████████| 3358/3358 [4:34:13<00:00, 1.99it/s]
(more hidden)

Do you have any idea? Thanks for your help.

LSTM_extract.py predict output nothing

Dear professor:
when i use the commond:
python ./DENA-release/step4_predict/LSTM_extract.py predict --fast5 ./singleFast5/ --corr_grp RawGenomeCorrected_000 --bam ./col.sorted.bam --sites ../candidate_predict_pos.txt --label col --windows 2 2 --processes 20

I get some tmp files, but 3_tmp has nothing.
625867 0_tmp
709963 10_tmp
678042 11_tmp
73825 12_tmp
701523 14_tmp
682437 15_tmp
691647 17_tmp
663647 18_tmp
651893 1_tmp
0 3_tmp
559607 4_tmp
612549 5_tmp
757275 6_tmp
706120 7_tmp
151 8_tmp
522064 9_tmp

And this causes the process to go to sleep.
415378 ? S 0:00 python

What can I do for this problem?

Best wishes.

Missing import pandas as pd in LSTM_predict.py

Hi,
I tried to run this command:

python LSTM_predict.py -i ${path_features} -m ${path_models} -o ${path_output} -p ${prefix_outfile} -d

and the -d option require the creation of a pandas data frame, however it seems that the LSTM_predict.py code doesn't do it, I had to modify and add "import pandas as pd" to avoid this error. Just to let you know.

Fabio

LSTM_extract.py predict 0 reads in bamfile

Dear developers,
I'm running dena on a dataset, and I am not able to get any results. In particular, I noticed that this command:
python3 /DENA/step4_predict/LSTM_extract.py predict --fast5 FAST5_dir --corr_grp RawGenomeCorrected_000 --bam transcriptome.bam --sites candidate_predict_pos.txt --label "dena_label" --windows 2 2 --processes 1 --debug
is producing only *_tmp empty files.
I noticed from the standard error that the tool is able to find reads in fast5 but not in bam file, e.g.:

transcript1	pos1-pos2found x reads in fast5
transcript1	pos1-pos2found 0 reads in bamfile

Do you know what may have caused the issue?
Thanks in advance,
Simone

Regarding about the output of LSTM_predict.py.

Hi,
I have used the DENA tools to detect RNA m6A in my data, and the "-d" parameter was added when I ran LSTM_predict.py. I got result ${prefix_outfile}_details.tsv like this:

ENST00000000233.9 317 AAACA
df2e44fd-337f-4d04-bcd6-95d9ce62e6ef 1.4540097
21a9b4c5-8ab5-42c2-a561-4aa9fffc6688 -2.8870084
dd4e2a5d-dfbe-45fd-bd18-c4b15ae12bae -0.56397027
a68bbb3e-93c4-4162-beeb-e1620273977f 0.92993855
6be8d8b0-b897-47ba-a86d-ff83b3e02ea4 0.026935648

As your description: the first column was the read ID; Second column was the m6A-modified probability of this read at the candidate coordinate on ENST00000000233.9). But I find the m6A-modified probability <0 and even >1, is the output result right? I didn't meet errors when run the process.
Thanks!

Abnormal result: The prediction output file is empty

Hi,
I tried to use DENA pre-train model to predict m6A of my data, I followed the step 4 (prediction) on my data.However, the output file both result.tsv and result_details.tsv are empty. It seems data problem, so I check extracted features file:

the candidate_predict_pos.txt

cc6m_2244_T7_ecorv	57	62	+	AAACC
cc6m_2244_T7_ecorv	64	69	+	GGACC
cc6m_2244_T7_ecorv	100	105	+	GGACT
cc6m_2244_T7_ecorv	121	126	+	GGACC
cc6m_2244_T7_ecorv	173	178	+	AGACT

the features tmp file

>cc6m_2595_T7_ecorv_724_GAACC
adec290a-fa23-4951-9863-51c26c65ab5e	15.0,5.0,2.0,3.0,4.0,116.0,12.0,29.0,19.0,6.0,2.05947621375705,0.05645722345324113,-0.6669842273497038,-1.1006984040938008,-0.04247432038586929,1.896162953567763,0.0775064880998608,-0.7244704949363322,-1.1033572585754785,-0.061418658567826176,nan,nan,nan,nan,nan
fb801573-cb5c-4303-bc93-2a03078640d3	37.0,36.0,43.0,39.0,27.0,54.0,47.0,15.0,19.0,139.0,2.1592956780773624,0.21144323911696417,-0.5459008501149659,-1.3049430979691616,-0.14210523308214737,2.103417210260246,0.19766736049756647,-0.5841787317127636,-1.3049430979691616,-0.1443903048444529,nan,nan,nan,nan,nan
0c957530-bdeb-4fe8-831b-7af44316226b	24.0,26.0,26.0,26.0,19.0,53.0,29.0,23.0,16.0,6.0,1.6108504242449508,0.3408191542598859,-0.5112268966474004,-1.0857024459872902,0.07775649380342814,1.389964503297866,0.19353955346470164,-0.47542923568932577,-1.073641710605908,0.019864963972790688,nan,nan,nan,nan,nan

I also print the intermediate variable predict_label, all nan

predict_label
tensor([[nan, nan],
        [nan, nan],
        [nan, nan],
        ...,
        [nan, nan],
        [nan, nan],
        [nan, nan]])
......

How could I resolve it? Thanks in advance.

Araport11 transcriptome for control

‌Dear DENA team,
I am opening issue because the email [[email protected]] do not works. I got my email back.

I would like to know wich Araport11 version you used in the DENA paper ?
Could you send me the link where you dowload it ?

I would like to repeat your analyses of m6A calling from your Col-0 data and compare with your output in Suplementary data to see if my installation and running of DENA is correct.

Best wiches,

Jeremy

Potential issue with re-sqguiggle step

Dear DENA author,

I have a potential issue when I run the tombo re-sqguiggle step,
I lost around half of my reads that cannot pass the re-sqguiggle step.
Is that expected? Did you face to this issue? Do you have an idea what could be the cause.
I am using the same transcriptome of Arabidopsis you used.

I already checked potential solution on Tombo "issue" but found no solution/explaination.

I got something like that below:

100%|▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒| 2977277/2977277 [14:30:45<00:00, 56.99it/s]
[07:12:26] Final unsuccessful reads summary (49.4% reads unsuccessfully processed; 1470935 total reads):
47.8% (1422467 reads) : Alignment not produced
1.6% ( 46193 reads) : Poor raw to expected signal matching (revert with tombo filter clear_filters)
0.1% ( 2264 reads) : Read event to sequence alignment extends beyond bandwidth
0.0% ( 7 reads) : Reference mapping contains non-canonical bases (transcriptome reference cannot contain U bases)
0.0% ( 2 reads) : Not enough raw signal around potential genomic deletion(s)
0.0% ( 2 reads) : Fastq slot not present in --basecall-group

Otherwise all the steps from your tool DENA works perfectly!

Hope you can help me for that.