dcdanko / minerva_barcode_deconvolution Goto Github PK

View Code? Open in Web Editor NEW

11.0 11.0 3.0 10.83 MB

Sort Linked Read DNA Into Fragment Specific Clusters

License: MIT License

C 13.42% Python 82.86% R 3.72%

minerva_barcode_deconvolution's People

Contributors

Stargazers

Watchers

Forkers

warisbarakzai lauren-mak yehia123

minerva_barcode_deconvolution's Issues

Only a fraction of reads are deconvolved

Hello,

I am trying to run Minerva on a toy dataset, but the output file of Minerva only contains about 1% of all the reads present in my original file. I have lowered the thresholds to -a 1 and -d 1 to exclude as few reads as possible. The command line is

cat ./reads_cov50_redundance4.fastq | minerva_deconvolve -k 20 -w 40 -d 1 -a 1 --remove-stopwords --eps 0.51 > results_minerva/deconvolved_minerva_E_coli.tsv

The problem may come from the way this toy data is generated, which is not identical to the output of longranger basic (the tags are number, not sequence, for example). Here are a few lines of the input fastq:

@read0_TBX:0 BX:Z:4104
AAAGCGAGTCGAACCACTTCCGAAGGAGCCGTTCGCTAATTGTGCACGAGTCTAAGTATGTATCTAGGACCTCTCCCTAAACCTCGATCTCGTGCCTTCGTCTGTCGTCCGATAGGCCTATGGCTACTCAGTTCTATTCTAGACGTCCTG
+
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
@read0_TBX:0 BX:Z:4104
ACTCAGTTAGACAAGAGGTACTTCAGAACCTAAGTGACAACCTTGTCTCTCGAGTGGGAGTACCCCGCCAAGTAAGCCTAGGATGATATGCCTACCAAAGCTACCAACGGGCACGTCATCCTTCTCGGCGCGAGGCCCAACGGGATTATG
+
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
@read1_TBX:0 BX:Z:4104
CGTGGATATGATGAGATCAACCTGAATGTCGGCTGCCCGTCTGACCGGGTGCAGAACGGCATGTTTGGTGCGTGTCTGATGGGTAATGCGCAGCTGGTTGCCGACTGCGTGAAAGCGATGCGCGATGTGGTGTCGATTCCGGTGACGGTG
+
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
@read1_TBX:0 BX:Z:4104
TTTCCGGCAAAGGCGAGTGTGAGATGTTCATCATCCACGCACGTAAAGCCTGGCTTTCGGGGTTAAGCCCGAAAGAAAACCGTGAAATCCCGCCGCTCGATTATCCGCGTGTGTATCAACTGAAGCGTGACTTTCCGCATCTGACGATGT
+
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA

Do you have any idea of what I should change to have a better result ?

Thanks in advance,
Roland

Problem with --help

After installation using the below line
pip install minerva_deconvolve --user
I run this
minerva_deconvolve --help

I faced this

Traceback (most recent call last):
  File "/home/majid001/.local/bin/minerva_deconvolve", line 6, in <module>
    from minerva.deconvolution.deconvolve_barcodes import main
  File "/home/majid001/.local/lib/python2.7/site-packages/minerva/deconvolution/deconvolve_barcodes.py", line 27
    print(msg.format(len(barcodeTables)), file=sys.stderr)
                                              ^
SyntaxError: invalid syntax

Would you please help me on this issue?

Out of memory error

Hello David,

I just discovered Minerva recently and I want to use it on my dataset. I was able to install it very easily and use it on a subset of my data. I am now trying to run Minerva on my whole 10X dataset (~300M PE reads, ~2.8M barcodes) on a server with 500 GB memory. As mentioned in the readme, I used the following command:

> cat barcoded.fastq | minerva_deconvolve -k 20 -w 40 -d 8 -a 20 --remove-stopwords --eps 0.51 > ebc_assignments.tsv

After running for a few days, the output file was still empty and the last output on stdout was:
"parsed 1,327,100" when it OOM.

My question is: did you ever used Minerva on a dataset this size? Is there a workaround to limit memory usage on my data? Also, in the paper, you mention that the method is easy to multithread. Is it something I can do on my end (for instance by splitting the fastq file) or something that might be included in Minerva in the future ?

Thank you,
Cédric

Error to install

The following is the error I got when install:

running build_ext
building 'cseqs' extension
gcc -pthread -B /research/cxs/anaconda2/compiler_compat -Wl,--sysroot=/ -fno-strict-aliasing -g -O2 -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -fPIC -I/research/cxs/anaconda2/include/python2.7 -c cext_minerva/_cseqs.c -o build/temp.linux-x86_64-2.7/cext_minerva/_cseqs.o
cext_minerva/_cseqs.c:36:15: error: variable 'cseqsmodule' has initializer but incomplete type
 static struct PyModuleDef cseqsmodule = {
               ^
cext_minerva/_cseqs.c:37:3: error: 'PyModuleDef_HEAD_INIT' undeclared here (not in a function)
   PyModuleDef_HEAD_INIT,
   ^
cext_minerva/_cseqs.c:37:3: warning: excess elements in struct initializer
cext_minerva/_cseqs.c:37:3: note: (near initialization for 'cseqsmodule')
cext_minerva/_cseqs.c:38:3: warning: excess elements in struct initializer
   "cseqs",
   ^
cext_minerva/_cseqs.c:38:3: note: (near initialization for 'cseqsmodule')
cext_minerva/_cseqs.c:39:3: warning: excess elements in struct initializer
   module_docstring,
   ^
cext_minerva/_cseqs.c:39:3: note: (near initialization for 'cseqsmodule')
cext_minerva/_cseqs.c:40:3: warning: excess elements in struct initializer
   -1,
   ^
cext_minerva/_cseqs.c:40:3: note: (near initialization for 'cseqsmodule')
cext_minerva/_cseqs.c:41:3: warning: excess elements in struct initializer
   cseqs_methods
   ^
cext_minerva/_cseqs.c:41:3: note: (near initialization for 'cseqsmodule')
cext_minerva/_cseqs.c: In function 'PyInit_cseqs':
cext_minerva/_cseqs.c:45:10: warning: implicit declaration of function 'PyModule_Create' [-Wimplicit-function-declaration]
   return PyModule_Create(&cseqsmodule);
          ^
cext_minerva/_cseqs.c:45:10: warning: 'return' with a value, in function returning void
error: command 'gcc' failed with exit status 1

after minerva_deconvolve

Hi @dcdanko !

I run the command you suggest in the README file for minerva_deconvolve and got my "ebc_assignments.tsv" file with 8 different clusters. My question is what analysis I can perform considering this information. An idea is to assembly separately the reads from each cluster. But have you some additional suggestion?

Thanks.

SyntaxError to install

I followed the installation procedure on README, and I can finish installation of requirements using pip(pip verson: 18.1; python verson: 3.6). But when I install Minerva from the code, it comes up a SyntaxError. The detail SyntaxError seems like:

After installation, the "minerva_annotate" and "minerva_eval" can run successfully. But the "minerva_deconvolve" and "minerva_enhance_kraken" can not be installed successfully.

Is my installation procedure wrong? I will be very appreciated if you could help me fix it.

division by zero error

I keep getting this division by zero error, with different datasets:

parsed 966,783 barcodes
Removing 2,518,439 stop and singleton kmers
Removed stop and singleton kmers
686,968 barcodes were at or above dropout threshold

Traceback (most recent call last):
  File "/research/c/anaconda3/bin/minerva_deconvolve", line 11, in <module>
    load_entry_point('minerva-barcoded-read-deconvolution', 'console_scripts', 'minerva_deconvolve')()
  File "/research/c/src/minerva_barcode_deconvolution/minerva/deconvolution/deconvolve_barcodes.py", line 35, in main
    progressBar.write()
  File "/research/c/src/minerva_barcode_deconvolution/minerva/deconvolution/progress_bar.py", line 23, in write
    p = self.events / self.total
ZeroDivisionError: float division by zero

running on provided sample data

Dear David,

When running the latest github version of Minerva on the sample data from the README (Dataset 1) using parameters recommended from the README (-k 20 -w 40 -d 8 -a 50 --remove-stopwords), the program produces no results to stdout, and only outputs debug info to stderr. This behavior doesn't happen on other test datasets, where proper results are printed to stdout. I also tried with -k 20 -a 50 as per the bioRxiv paper and no luck. What were the parameters used on that dataset?

To reproduce the issue

wget https://s3.us-east-2.amazonaws.com/minerva-datasets/10M.data1_atgctgaaq.fq.gz
zcat 10M.data1_atgctgaaq.fq.gz |  minerva_deconvolve -k 20 -w 40 -d 8 -a 50 --remove-stopwords > results.tsv
wc -l results.tsv # returns 0

Thanks in advance,
Rayan

How to interpret the result: the third column

I understand the second column is the read id, while the first column is the corresponding barcode.
The third column is an integer, its meaning is not that straightforward.

dcdanko / minerva_barcode_deconvolution Goto Github PK

minerva_barcode_deconvolution's People

Contributors

Stargazers

Watchers

Forkers

minerva_barcode_deconvolution's Issues

Only a fraction of reads are deconvolved

Problem with --help

Out of memory error

Error to install

after minerva_deconvolve

SyntaxError to install

division by zero error

running on provided sample data

How to interpret the result: the third column

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent