dcdanko / minerva_barcode_deconvolution Goto Github PK
View Code? Open in Web Editor NEWSort Linked Read DNA Into Fragment Specific Clusters
License: MIT License
Sort Linked Read DNA Into Fragment Specific Clusters
License: MIT License
Hello,
I am trying to run Minerva on a toy dataset, but the output file of Minerva only contains about 1% of all the reads present in my original file. I have lowered the thresholds to -a 1 and -d 1 to exclude as few reads as possible. The command line is
cat ./reads_cov50_redundance4.fastq | minerva_deconvolve -k 20 -w 40 -d 1 -a 1 --remove-stopwords --eps 0.51 > results_minerva/deconvolved_minerva_E_coli.tsv
The problem may come from the way this toy data is generated, which is not identical to the output of longranger basic (the tags are number, not sequence, for example). Here are a few lines of the input fastq:
@read0_TBX:0 BX:Z:4104
AAAGCGAGTCGAACCACTTCCGAAGGAGCCGTTCGCTAATTGTGCACGAGTCTAAGTATGTATCTAGGACCTCTCCCTAAACCTCGATCTCGTGCCTTCGTCTGTCGTCCGATAGGCCTATGGCTACTCAGTTCTATTCTAGACGTCCTG
+
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
@read0_TBX:0 BX:Z:4104
ACTCAGTTAGACAAGAGGTACTTCAGAACCTAAGTGACAACCTTGTCTCTCGAGTGGGAGTACCCCGCCAAGTAAGCCTAGGATGATATGCCTACCAAAGCTACCAACGGGCACGTCATCCTTCTCGGCGCGAGGCCCAACGGGATTATG
+
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
@read1_TBX:0 BX:Z:4104
CGTGGATATGATGAGATCAACCTGAATGTCGGCTGCCCGTCTGACCGGGTGCAGAACGGCATGTTTGGTGCGTGTCTGATGGGTAATGCGCAGCTGGTTGCCGACTGCGTGAAAGCGATGCGCGATGTGGTGTCGATTCCGGTGACGGTG
+
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
@read1_TBX:0 BX:Z:4104
TTTCCGGCAAAGGCGAGTGTGAGATGTTCATCATCCACGCACGTAAAGCCTGGCTTTCGGGGTTAAGCCCGAAAGAAAACCGTGAAATCCCGCCGCTCGATTATCCGCGTGTGTATCAACTGAAGCGTGACTTTCCGCATCTGACGATGT
+
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
Do you have any idea of what I should change to have a better result ?
Thanks in advance,
Roland
Hi
After installation using the below line
pip install minerva_deconvolve --user
I run this
minerva_deconvolve --help
I faced this
Traceback (most recent call last):
File "/home/majid001/.local/bin/minerva_deconvolve", line 6, in <module>
from minerva.deconvolution.deconvolve_barcodes import main
File "/home/majid001/.local/lib/python2.7/site-packages/minerva/deconvolution/deconvolve_barcodes.py", line 27
print(msg.format(len(barcodeTables)), file=sys.stderr)
^
SyntaxError: invalid syntax
Would you please help me on this issue?
Hello David,
I just discovered Minerva recently and I want to use it on my dataset. I was able to install it very easily and use it on a subset of my data. I am now trying to run Minerva on my whole 10X dataset (~300M PE reads, ~2.8M barcodes) on a server with 500 GB memory. As mentioned in the readme, I used the following command:
> cat barcoded.fastq | minerva_deconvolve -k 20 -w 40 -d 8 -a 20 --remove-stopwords --eps 0.51 > ebc_assignments.tsv
After running for a few days, the output file was still empty and the last output on stdout was:
"parsed 1,327,100" when it OOM.
My question is: did you ever used Minerva on a dataset this size? Is there a workaround to limit memory usage on my data? Also, in the paper, you mention that the method is easy to multithread. Is it something I can do on my end (for instance by splitting the fastq file) or something that might be included in Minerva in the future ?
Thank you,
Cédric
The following is the error I got when install:
running build_ext
building 'cseqs' extension
gcc -pthread -B /research/cxs/anaconda2/compiler_compat -Wl,--sysroot=/ -fno-strict-aliasing -g -O2 -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -fPIC -I/research/cxs/anaconda2/include/python2.7 -c cext_minerva/_cseqs.c -o build/temp.linux-x86_64-2.7/cext_minerva/_cseqs.o
cext_minerva/_cseqs.c:36:15: error: variable 'cseqsmodule' has initializer but incomplete type
static struct PyModuleDef cseqsmodule = {
^
cext_minerva/_cseqs.c:37:3: error: 'PyModuleDef_HEAD_INIT' undeclared here (not in a function)
PyModuleDef_HEAD_INIT,
^
cext_minerva/_cseqs.c:37:3: warning: excess elements in struct initializer
cext_minerva/_cseqs.c:37:3: note: (near initialization for 'cseqsmodule')
cext_minerva/_cseqs.c:38:3: warning: excess elements in struct initializer
"cseqs",
^
cext_minerva/_cseqs.c:38:3: note: (near initialization for 'cseqsmodule')
cext_minerva/_cseqs.c:39:3: warning: excess elements in struct initializer
module_docstring,
^
cext_minerva/_cseqs.c:39:3: note: (near initialization for 'cseqsmodule')
cext_minerva/_cseqs.c:40:3: warning: excess elements in struct initializer
-1,
^
cext_minerva/_cseqs.c:40:3: note: (near initialization for 'cseqsmodule')
cext_minerva/_cseqs.c:41:3: warning: excess elements in struct initializer
cseqs_methods
^
cext_minerva/_cseqs.c:41:3: note: (near initialization for 'cseqsmodule')
cext_minerva/_cseqs.c: In function 'PyInit_cseqs':
cext_minerva/_cseqs.c:45:10: warning: implicit declaration of function 'PyModule_Create' [-Wimplicit-function-declaration]
return PyModule_Create(&cseqsmodule);
^
cext_minerva/_cseqs.c:45:10: warning: 'return' with a value, in function returning void
error: command 'gcc' failed with exit status 1
Hi @dcdanko !
I run the command you suggest in the README file for minerva_deconvolve and got my "ebc_assignments.tsv" file with 8 different clusters. My question is what analysis I can perform considering this information. An idea is to assembly separately the reads from each cluster. But have you some additional suggestion?
Thanks.
I followed the installation procedure on README, and I can finish installation of requirements using pip(pip verson: 18.1; python verson: 3.6). But when I install Minerva from the code, it comes up a SyntaxError. The detail SyntaxError seems like:
After installation, the "minerva_annotate" and "minerva_eval" can run successfully. But the "minerva_deconvolve" and "minerva_enhance_kraken" can not be installed successfully.
Is my installation procedure wrong? I will be very appreciated if you could help me fix it.
I keep getting this division by zero error, with different datasets:
parsed 966,783 barcodes
Removing 2,518,439 stop and singleton kmers
Removed stop and singleton kmers
686,968 barcodes were at or above dropout threshold
Traceback (most recent call last):
File "/research/c/anaconda3/bin/minerva_deconvolve", line 11, in <module>
load_entry_point('minerva-barcoded-read-deconvolution', 'console_scripts', 'minerva_deconvolve')()
File "/research/c/src/minerva_barcode_deconvolution/minerva/deconvolution/deconvolve_barcodes.py", line 35, in main
progressBar.write()
File "/research/c/src/minerva_barcode_deconvolution/minerva/deconvolution/progress_bar.py", line 23, in write
p = self.events / self.total
ZeroDivisionError: float division by zero
Dear David,
When running the latest github version of Minerva on the sample data from the README (Dataset 1) using parameters recommended from the README (-k 20 -w 40 -d 8 -a 50 --remove-stopwords
), the program produces no results to stdout
, and only outputs debug info to stderr
. This behavior doesn't happen on other test datasets, where proper results are printed to stdout
. I also tried with -k 20 -a 50
as per the bioRxiv paper and no luck. What were the parameters used on that dataset?
To reproduce the issue
wget https://s3.us-east-2.amazonaws.com/minerva-datasets/10M.data1_atgctgaaq.fq.gz
zcat 10M.data1_atgctgaaq.fq.gz | minerva_deconvolve -k 20 -w 40 -d 8 -a 50 --remove-stopwords > results.tsv
wc -l results.tsv # returns 0
Thanks in advance,
Rayan
I understand the second column is the read id, while the first column is the corresponding barcode.
The third column is an integer, its meaning is not that straightforward.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.