natir / yacrd Goto Github PK
View Code? Open in Web Editor NEWYet Another Chimeric Read Detector
License: MIT License
Yet Another Chimeric Read Detector
License: MIT License
During an update due to migrating to GCC10 on Bioconda I'm seeing the following error:
2022-02-28T23:15:17.9297028Z 23:15:17 BIOCONDA INFO (OUT) error[E0432]: unresolved import `clap::Clap`
2022-02-28T23:15:17.9297782Z 23:15:17 BIOCONDA INFO (OUT) --> src/main.rs:25:5
2022-02-28T23:15:17.9298056Z 23:15:17 BIOCONDA INFO (OUT) |
2022-02-28T23:15:17.9298417Z 23:15:17 BIOCONDA INFO (OUT) 25 | use clap::Clap;
2022-02-28T23:15:17.9298802Z 23:15:17 BIOCONDA INFO (OUT) | ^^^^^^^^^^ no `Clap` in the root
Perhaps clap has changed its API?
EDIT: False alarm, I forgot a parameter in the command. (chimeric
).
It actually works very well. It was just the Readme that confused me a little bit 😅 .
Hi Pierre,
I'm very interested by your package, but the installation fails or almost fails. I tried to install from conda, source and cargo on my mac 10.14.4 and via Docker (official Rust image) for source and cargo. The two next commands outputs the same thing everytime:
yacrd 0.5.1 Omanyte
Pierre Marijon <[email protected]>
Yet Another Chimeric Read Detector
USAGE:
yacrd [SUBCOMMAND]
FLAGS:
-h, --help Prints help information
-V, --version Prints version information
SUBCOMMANDS:
chimeric In chimeric mode yacrd detect chimera if coverage gap are in middle of read
help Prints this message or the help of the given subcommand(s)
scrubbing In scrubbing mode yacrd remove all part of read not covered
yacrd -i /data/mapping.paf -o /data/reads.yacrd
error: Found argument '-i' which wasn't expected, or isn't valid in this context
USAGE:
yacrd [SUBCOMMAND]
For more information try --help
Do you have any idea of the origin of the problem?
Regards,
I'm getting this error:
$ yacrd scrubb -i run1.all.report.yacrd -i run1.fastq -o run2.scrubbed.fastq
error: The argument '--input <input>' was provided more than once, but cannot be used multiple times
USAGE:
yacrd --input <input> --output <output> scrubb --input <input> --output <output>
For more information try --help
Which seems odd that all the readme's show multiple inputs and outputs? I don't recall seeing this behavior on centOS. This is installed on Mac OS via bioconda.
$ yacrd --version
yacrd 0.6.0
Write in a file records if they contains a chimeric read.
Hi there,
I've just been testing out yacrd, with the following command:
$MINIMAP2 -x ava-ont -g 500 -t 72 $1.chopped.fastq.gz $1.chopped.fastq.gz | \
yacrd chimeric -f $1.chopped.fastq.gz > $1_dechimerized.fastq.gz
ERR3219853.fastq.gz.chopped.fastq.gz
I ran it on a file named ERR3219853.fastq.gz
(after also running PoreChop over that).
What I expected, based on the readme, was for the fastq.gz to be written to ERR3219853.fastq.gz_dechimerized.fastq.gz
. Instead, what I got was a text-based report written to ERR3219853.fastq.gz_dechimerized.fastq.gz
, and the dechimerized fastq written to `ERR3219853_filtered.fastq.gz.chopped.fastq.gz.
Re-reading the readme, and digging into the source code, I now understand what it's getting at, but I think it would be better to more explicitly state how yacrd munges the output filename by inserting _filtered before the "."
. Even better would be to provide an option to specify the output filename (and report filename) explicitly.
Hi,
We are trying to detect chimeras in some nanopore 16s reads. Although we already know that there are some chimeras in our reads, we are not able to detect them using the parameters from the yacrd example:
minimap2 -x ava-ont -g 500 reads.fasta reads.fasta > overlap.paf
yacrd -i overlap.paf -o report.yacrd -c 4 -n 0.4 scrubb -i reads.fasta -o reads.scrubb.fasta
Could you please help us with the correct parameters to use for this detection?
Thanks in advance.
Regards
Dear Natir,
Thanks for the great tool for removing chimera from nanopore reads! Amazing!
The scrubb and filter commands work nicely for me using .paf as an input. However, I want to filter my reads now by the .yacrd report file. I am getting an error: "src/stack.rs 205". The command I was using is:
yacrd -i test.yacrd -o chimera/test.output.yacrd filter -i test.fastq -o chimera/test.fastq
Maybe I am doing something wrong? I would be very grateful if you could help me out.
Thanks a lot in advance for your kind support.
Best,
Maraike
Hi,
great tool, excited to get this to work!
I want to use yacrd to detect chimeric reads in my pacbio ccs data. The data is amplified and the coverage is a little bit low. After running yacrd, I noticed the raw fastq file sequences ID did not fully appear in the *report.yacrd, are there any reads that have been filtered and not labeled? If I just want to detect and remove chimeric reads, can I just delete the reads labeled Chimeric? What pipeline should I use? Could you please give me some suggestions?
This is the command I used:
minimap2 -x ava-pb A.fastq.gz A.fastq.gz >A.overlap.paf
yacrd -i A.overlap.paf -o A.report.yacrd split -i A.fastq.gz -o A.filter.fastq.gz
Thank you very much!
we can use :
Create a PostDetectionOperation trait to factorize many code between each post-operation type and support format.
Hi
I have been given some fastq files post demultiplexing via Guppy and I was thinking to checking chimera reads. This data is amplicon data from FMD virus (Amplicon size 400bp). The genome of mRNA virus is 8.3Kb long (I know you said yacrd is only for DNA direct sequencing here and ours is cDNA). And I was looking for best practices of analyzing this data. I wish to understand if I should run Chimera detection step first followed by read scrubbing or vice versa. I realized that scrubbing tend to split some chimeric reads in an issue here. So I was thinking to first perform chinera detection> splitting those reads as suggested in issue above > perform scrubbing on resulting fastq file?
The issue is we don't know weather to expect chimera reads or not because we are still experimenting around and we would like to know if there are such reads. So this is exploratory question.
Issue #28 show yacrd scrubbing
only accept sequence file with extension fasta
and fastq
.
We need support sequence file with any type of extension.
Wait integration of rust-bio/rust-bio#222 in stable version of rust-bio
Hi!
I am wondering if I can use ycard on my set of ONT cDNA reads that were generated using the SQK-PCS109 PCR-cDNA Sequencing Kit?
Will it remove rare splice variants?
Thanks for your thoughts on this.
Michael
Two solution:
Write yacrd out in json format.
If input are like ../something/blabla.paf
result need to be ../something/blabla_suffix.paf
actualy is _suffix../something/blabla.paf
other test case ../something.other/blabla.paf
Hi!
I'd like to know if I can use yacrd to scrub long reads obtained from MDA samples, specifically in terms of removing palindromes.
If I run yacrd in a scrubbing mode, can I expect that yacrd would remove palindromes in long reads introduced by MDA?
Actually, I've tried to use Pacasus, a tool developed specifically to correct palindromes in long reads from MDA. (https://github.com/swarris/Pacasus)
But, unfortunately, I haven't succeeded in installing the tool on my system.
Thanks.
Ilnam
Add cli option :
-f --filter option takes the file to filter
-o --output takes the file name where the filtered data is written
Supported format :
Hello, @natir,
I'm currently looking into long read scrubbing and came about your tools and publications, and the comparison with DASCRUBBER, but you just briefly mention MiniScrub.
Do you have or know of any comparisons on effect of MiniScrub vs yacrd?
Greetings.
Wait for integration in rust stable channel of rust-lang/rust#44489
Hi,
first of all, thank you for this tool. It's very helpful!
but I still don't get how it really works. I used yacrd on my nanopore genomic reads applying recommended parameters:
minimap2 -x ava-ont -g 500 reads.fasta reads.fasta > overlap.paf
yacrd -i overlap.paf -o report.yacrd -c 4 -n 0.4
in the output, I get a number of chimeric reads. One of these looks realistic with quite large bad regions in the alignment such as:
but some looks weird with just a little misalignment:
zero-coverage regions are less than 40 nt which is less than 1% of the overall read length. Why yacrd thinks that these reads are chimeric?
Could you explain how -c and -n parameters really work?
Thanks in advance!
Sergei
Hi,
I have 77 gb filtered ONT long reads (30x coverage of my target genome), now I would like to know if there are chimeric reads and if so I would like to either split/scrub them or discard them. So, what pipeline would you recommend in terms of using fpa and yacrd? I am confused what's the difference between split and scrub? To me, it looks like scrub also split chimeric reads and does some extra trimming, is this correct?
Thank you very much.
Best wishes,
Yutang
Dear all,
I successfully used yacrd after the minimap2 execution. It worked perfectly. I have the following doubt: why do we need to use the overlap step using minimap2. I refer to the instructions reported below (from the yacrd website):
Here is again my question: why at step 1 do we have to perform the overlap task? Which information will give us back used to identify the chimeric reads?
Probably I miss some information.
Thank you very much for your support.
Dr. Mastriani Emilio
Any ideas? Trying to make a homebrew
package for it.
g++-5 -I/tmp/yacrd-20180421-139125-1jr8c12/yacrd-0.2/inc -DNDEBUG -O3 -flto -march=native -mtune=native -std=c++11 -o CMakeFiles/yacrd.dir/src/analysis.cpp.o -c /tmp/yacrd-20180421-139125-1jr8c12/yacrd-0.2/src/analysis.cpp
/tmp/yacrd-20180421-139125-1jr8c12/yacrd-0.2/src/analysis.cpp:
In function 'std::unordered_set<std::__cxx11::basic_string<char> > yacrd::analysis::find_chimera(const string&, uint64_t, float)':
/tmp/yacrd-20180421-139125-1jr8c12/yacrd-0.2/src/analysis.cpp:52:15:
error: converting to 'std::priority_queue<long unsigned int, std::vector<long unsigned int>, std::greater<long unsigned int> >' from initializer list would use explicit constructor 'std::priority_queue<_Tp, _Sequence, _Compare>::priority_queue(const _Compare&, _Sequence&&) [with _Tp = long unsigned int; _Sequence = std::vector<long unsigned int>; _Compare = std::greater<long unsigned int>]'
stack = {};
^
Using gcc 5.5 on Linux:
cmake version 3.11.1
gcc version 5.5.0 (Homebrew gcc 5.5.0_4)
Hi, I installed yacrd and fpa by conda, fpa worked but yacrd failed, it reported error like this:
thread 'main' panicked at 'called Option::unwrap()
on a None
value', src/libcore/option.rs:355:21
note: Run with RUST_BACKTRACE=1
for a backtrace.
I tried on different machines but still, the same issue occurred . Could you please give any ideas about this? Thanks!
At this stage all overlap is load and memory for large overlapping file it's a huge problem.
I imagine this is a change in clap, but on Bioconda I'm running into the following errors during compilation:
2021-03-31T20:34:40.9529590Z 20:34:40 BIOCONDA INFO (ERR) error[E0308]: mismatched types
2021-03-31T20:34:40.9533120Z 20:34:40 BIOCONDA INFO (ERR) --> src/cli.rs:45:17
2021-03-31T20:34:40.9535020Z 20:34:40 BIOCONDA INFO (ERR) |
2021-03-31T20:34:40.9536860Z 20:34:40 BIOCONDA INFO (ERR) 45 | short = "i",
2021-03-31T20:34:40.9538000Z 20:34:40 BIOCONDA INFO (ERR) | ^^^ expected `char`, found `&str`
2021-03-31T20:34:40.9538590Z 20:34:40 BIOCONDA INFO (ERR)
Perhaps you've already fixed this in the main branch (I don't know rust, but I assume this should be 'i'
rather than "i"
) and if so it'd be great if you could tag a new release soon.
% yacrd --version
yacrd 0.2
Helps a lot in pipeline audits.
Hello @natir ,
I wanted to test your tool on a set of contigs, to see whether it can detect "chimeric" contigs as well. But after just 2 min, yacrd crashed with this error message:
thread 'main' panicked at 'called `Option::unwrap()` on a `None` value', libcore/option.rs:345:21 note: Run with `RUST_BACKTRACE=1` for a backtrace.
The command I ran is :
yacrd -i sample.mecat2.racon.noN.self.olp.paf -o yacrd.out -f fasta -e fasta -s fasta
Any idea, what might have gone wrong?
Best,
Julien
Hi,
I was running yacrd with the following commands:
${SINGULARITYdir}minimap2.simg minimap2 -x ava-ont -t $SLURM_CPUS_PER_TASK -g 500 ${TMPdir}filtered_reads.fq.gz ${TMPdir}filtered_reads.fq.gz >${TMPdir}overlap.paf
yacrd -i ${TMPdir}overlap.paf -o ${TMPdir}report.yacrd -c 4 -n 0.4 scrubb -i ${TMPdir}filtered_reads.fq -o ${TMPdir}reads.scrubb.fasta
and got the following error:
\thread 'main' panicked at 'slice index starts at 8092 but ends at 7507', src/libcore/slice/mod.rs:2670:5
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace.
SIGABRT: abort
PC=0x47cdab m=0 sigcode=0
goroutine 1 [running, locked to thread]:
syscall.RawSyscall(0x3e, 0x10421, 0x6, 0x0, 0x0, 0xc000110180, 0xc000110180)
/usr/lib/golang/src/syscall/asm_linux_amd64.s:78 +0x2b fp=0xc00023be70 sp=0xc00023be68 pc=0x47cdab
syscall.Kill(0x10421, 0x6, 0x0, 0x0)
/usr/lib/golang/src/syscall/zsyscall_linux_amd64.go:597 +0x4b fp=0xc00023beb8 sp=0xc00023be70 pc=0x479bcb
github.com/sylabs/singularity/internal/app/starter.Master.func2()
internal/app/starter/master_linux.go:152 +0x61 fp=0xc00023bf00 sp=0xc00023beb8 pc=0x7928f1
github.com/sylabs/singularity/internal/pkg/util/mainthread.Execute.func1()
internal/pkg/util/mainthread/mainthread.go:21 +0x2f fp=0xc00023bf28 sp=0xc00023bf00 pc=0x790f4f
main.main()
cmd/starter/main_linux.go:102 +0x5f fp=0xc00023bf60 sp=0xc00023bf28 pc=0x972bbf
runtime.main()
/usr/lib/golang/src/runtime/proc.go:203 +0x21e fp=0xc00023bfe0 sp=0xc00023bf60 pc=0x433b4e
runtime.goexit()
/usr/lib/golang/src/runtime/asm_amd64.s:1357 +0x1 fp=0xc00023bfe8 sp=0xc00023bfe0 pc=0x45f7c1
goroutine 19 [syscall]:
os/signal.signal_recv(0xb9da80)
/usr/lib/golang/src/runtime/sigqueue.go:147 +0x9c
os/signal.loop()
/usr/lib/golang/src/os/signal/signal_unix.go:23 +0x22
created by os/signal.init.0
/usr/lib/golang/src/os/signal/signal_unix.go:29 +0x41
goroutine 5 [chan receive]:
github.com/sylabs/singularity/internal/pkg/util/mainthread.Execute(0xc0003cc400)
internal/pkg/util/mainthread/mainthread.go:24 +0xb4
github.com/sylabs/singularity/internal/app/starter.Master(0x7, 0x4, 0x10436, 0xc00000e140)
internal/app/starter/master_linux.go:151 +0x44c
main.startup()
cmd/starter/main_linux.go:75 +0x53e
created by main.main
cmd/starter/main_linux.go:98 +0x35
rax 0x0
rbx 0x0
rcx 0xffffffffffffffff
rdx 0x0
rdi 0x10421
rsi 0x6
rbp 0xc00023bea8
rsp 0xc00023be68
r8 0x0
r9 0x0
r10 0x0
r11 0x202
r12 0xff
r13 0x0
r14 0xb83b64
r15 0x0
rip 0x47cdab
rflags 0x202
cs 0x33
fs 0x0
gs 0x0
Do you have any idea what could have triggered this error?
I am rerunning it now with backtrace enabled.
All the best and thank you, Dominik
Hi,
I want to detect chimeras on 16S nanopore data, similar to this post I've tried vsearch now, but as vsearch was developed for high quality short reads, I think a lot of false positive chimeric sequences are found.
@natir in that post you state "If a read has a poor-quality region in the middle, it's considered chimeric.". But - if I'm not mistaken - this does not lead to correct chimera detection of amplicon chimeras? In amplicon sequencing the error profile (i.e. the poor-quality region) is not related to the read being chimeric or not.
So can I state correctly that yacrd is not suitable for chimera detection of amplicon nanopore data?
Thanks for this nice tool!
Since it starts with an all-vs-all comparison, is it OK to use the parameter -X in minimap2 ("skip self and dual mappings (for the all-vs-all mode)") to save time and disk space?
Hey, @natir,
Any experience with using yacrd to clean MDA derived reads from wgaDNA?
Hi there,
I have been running the spaghetti.sh script for my Nanopore sequencing data. All samples work fine except for one which produces this error:
**Error: Error in compression detection of file barcode41_set2_comb-porechop-nanofilt.paf
Caused by:
File is too short, less than five bytes**
The file doesn't seem to have any issues (similar size and data as the other ones).
Any suggestions?
Two solution:
Hi,
so the tool ran easily - thanks - but I am a little concerned with the results.
wc -l *.yacrd
1964840 iddm_report.yacrd
grep -c Chimeric iddm_report.yacrd
114108
grep -c NotBad iddm_report.yacrd
454940
grep -c NotCov iddm_report.yacrd
1395792
As I understand it, out of 1.9m reads, only 454k are NotBad and can therefore be used in further analyses ? From work to date with the unfiltered data (WGS Rat, just genomic alignments), I think most reads are pretty decent.
Or should I be happy with the NotCov reads ?
Commands:
srun -c 16 minimap2 -t 16 -x ava-ont -g 500 iddm_30kbp_3325_comb.fastq.gz iddm_30kbp_3325_comb.fastq.gz > iddm_overlaps.paf &
yacrd -i iddm_overlaps.paf -o report.yacrd -c 4 -n 0.4 scrubb -i iddm_30kbp_3325_comb.fastq.gz -o iddm_30kbp_3325_comb.fastq.gz.scrubb.fasta
HI. I'm trying to install yacrd on CentOS 7.4.1708 with cmake v2.8.12.2 and GNU make v3..82
$ git clone https://github.com/natir/yacrd.git
Cloning into 'yacrd'...
remote: Counting objects: 129, done.
remote: Compressing objects: 100% (82/82), done.
remote: Total 129 (delta 70), reused 93 (delta 42), pack-reused 0
Receiving objects: 100% (129/129), 31.66 KiB | 0 bytes/s, done.
Resolving deltas: 100% (70/70), done.
$ cd yacrd
$ ls -l
total 12
-rw-r--r--. 1 root root 548 Mar 29 16:15 CMakeLists.txt
drwxr-xr-x. 2 root root 34 Mar 29 16:15 image
drwxr-xr-x. 2 root root 97 Mar 29 16:15 inc
-rw-r--r--. 1 root root 1071 Mar 29 16:15 LICENSE
-rw-r--r--. 1 root root 2480 Mar 29 16:15 Readme.md
drwxr-xr-x. 2 root root 117 Mar 29 16:15 src
drwxr-xr-x. 2 root root 28 Mar 29 16:15 test
$ mkdir build
$ cd build
$ cmake ..
-- The C compiler identification is GNU 4.8.5
-- The CXX compiler identification is GNU 4.8.5
-- Check for working C compiler: /usr/bin/cc
-- Check for working C compiler: /usr/bin/cc -- works
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Check for working CXX compiler: /usr/bin/c++
-- Check for working CXX compiler: /usr/bin/c++ -- works
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Check for working CXX compiler: /usr/bin/c++
-- Check for working CXX compiler: /usr/bin/c++ -- works
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Configuring done
-- Generating done
-- Build files have been written to: /home/software/yacrd/build
$ make
Scanning dependencies of target yacrd
[ 20%] Building CXX object CMakeFiles/yacrd.dir/src/analysis.cpp.o
c++: error: unrecognized command line option â-Wodrâ
c++: error: unrecognized command line option â-std=c++14â
make[2]: *** [CMakeFiles/yacrd.dir/src/analysis.cpp.o] Error 1
make[1]: *** [CMakeFiles/yacrd.dir/all] Error 2
make: *** [all] Error 2
$ ls -l
total 28
-rw-r--r--. 1 root root 12003 Mar 29 16:16 CMakeCache.txt
drwxr-xr-x. 6 root root 4096 Mar 29 16:26 CMakeFiles
-rw-r--r--. 1 root root 1582 Mar 29 16:16 cmake_install.cmake
-rw-r--r--. 1 root root 7669 Mar 29 16:16 Makefile
If read as a chimera remove not covered region split read on this region.
Could you please clarify (and maybe also mention in the documentation): when running yacrd chimeric
, does it also perform read scrubbing?
I'd like to know whether I need to re-align the dechimerised reads -- ideally I'd like to not have to, as the all-to-all alignment is fairly expensive, even with minimap2.
Hello, @natir,
I tried running yacrd to scrub my reads and got the following output:
thread 'main' panicked at 'called
Option::unwrap()
on aNone
value', src/libcore/option.rs:355:21
note: Run withRUST_BACKTRACE=1
for a backtrace.
I ran the script like this:
minimap2 -x ava-ont -g 500 SRR10150407_1_merged.fq.gz SRR10150407_1_merged.fq.gz > overlap.paf
yacrd scrubbing -c 3 -n 0.4 -m overlap.paf -s SRR10150407_1_merged.fq.gz -S reads_scrubbed.fasta -r scrubbed_report.yacrd
Cheers
What's the right way to obtain a splitted fasta?
I tried
yacrd -i 35k_all_vs_all.paf -s yacrd.fa
and it gives a few lines like:
Chimeric ERR1716491.27306 21246 152,0,152;187,2522,2709;13514,6541,20055;24,21222,21246 Chimeric ERR1716491.58156 26998 2361,0,2361;2992,2680,5672;3227,6300,9527;3615,9919,13534;2457,16289,18746;5,26993,26998 ...
and then crashes with
thread 'main' panicked at 'called
Option::unwrap()on a
Nonevalue', libcore/option.rs:345:21 note: Run with
RUST_BACKTRACE=1 for a backtrace.
producing no fasta file. Dataset is https://transfer.sh/Renf8/35k_all_vs_all.paf in case you need it
Hi @jguhlin.
You seem to have some interesting ideas to improve yacrd runtime, (I stole one, by the way), can we discuss it somewhere?
Mail, twitter, this issue ?
Thanks for your intrest on yacrd.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.