bensutherland / edna_metabarcoding Goto Github PK

View Code? Open in Web Editor NEW

10.0 3.0 2.0 1.17 MB

Pipeline to analyze eDNA metabarcoding samples (PE and SE, demultiplexed, multiplexed)

R 73.14% Shell 26.86%

edna metabarcoding

edna_metabarcoding's People

Contributors

Stargazers

Watchers

Forkers

viralchris carlahurt

edna_metabarcoding's Issues

In single-end data, with lower denoising, obitab will not see the *_HS.fa file

We do not do the internal removal for the denoising with single-end data due to the large number of unique amplicons. Because of this, we do not have an *_HS.fa output file, and therefore the standard obitab script will not work on SE data.

The solution using 'unidentified' results in the need for an extra subsequent script

Can we re-name the 'unidentified' files to not require a subsequent different script for after the '02_ngsfilter_SE_exp_unident.sh', that is, can we somehow make only a single obiannotate.sh script? Currently there are two: obiannotate_unident.sh and obiannotate_ident.sh.

00b_split_by_type.sh - grepping issue

Hello!
I am so happy to have found this page - I have been trying to separate demultiplexed sequences using ngsfilter for so long, and this code has been so useful!

I have unfortunately hit a problem when trying to grep sequences using the tags attached via ngsfilter. I finally got my code to work (and grep!), but it grepped 0 entries 😢

Here is my code so far -- I've modified it to work within my environment.

activate virtual env

source obi3-env/bin/activate
cd {folder where all the files are}

import a few sequences to test code functionality

obi import raw_sequences/E-AFR090512_S75_L001_R1_001.fastq.gz test_072022/E-AFR090512_S75_L001_R1
obi import raw_sequences/E-AFR090512_S75_L001_R2_001.fastq.gz test_072022/E-AFR090512_S75_L001_R2
obi import raw_sequences/E-ALM180712_S34_L001_R1_001.fastq.gz test_072022/E-ALM180712_S34_L001_R1
obi import raw_sequences/E-ALM180712_S34_L001_R2_001.fastq.gz test_072022/E-ALM180712_S34_L001_R2
obi import raw_sequences/E-ANE040812_S36_L001_R1_001.fastq.gz test_072022/E-ANE040812_S36_L001_R1
obi import raw_sequences/E-ANE040812_S36_L001_R2_001.fastq.gz test_072022/E-ANE040812_S36_L001_R2

for some reason, obi import does NOT want to work within a for-loop, so I just do it manually

import the ngsfilter file w/ info on sequences

obi import --ngsfilter baboon_diet_ngsfilter.txt test_072022/ngsfile

check if import worked

obi ls test_072022

create file with all the sample names to use for for-loops throughout pipeline

ls *_R1_001.fastq.gz | cut -c -23 > ../samples_R1
ls *_R2_001.fastq.gz | cut -c -23 > ../samples_R2

add primer tags using ngsfilter

for sample in $(cat samples_R1)
do
echo "On sample: $sample"
obi ngsfilter -t ngsfile -u test_072022/unidentified_${sample} test_072022/${sample} test_072022/identified_${sample}
done

separate samples using ngsfilter and grep
first test it using one sample before putting it in a for-loop

obi grep -E -A3 'sample=trnl' test_072022/identified_E-AFR090512_S75_L001_R1 | obi grep -vE '^--$' - > trnl_E-AFR090512_S75_L001_R1

error codes: "error: unrecognized arguments: -s", "error: unrecognized arguments: -E", "error: argument -v/--invert-selection: ignored explicit argument 'E'"

obi grep -S 'trnl' test_072022/identified_E-AFR090512_S75_L001_R1 | obi grep '^--$' - > trnl_E-AFR090512_S75_L001_R1

error codes: "error: the following arguments are required: OUTPUT", "ValueError: unknown url type: '^--$'", "FileNotFoundError: [Errno 2] No such file or directory: '^--$'"

obi grep -S 'trnl' test_072022/identified_E-AFR090512_S75_L001_R1 trnl_E-AFR090512_S75_L001_R1

results: "2022-06-30 19:36:26,770 [grep : INFO ] Grepped 0 entries"

So I've struggled to figure out how to grep sequences individually within my file and filter them into a new file. Is the original script formatted for obitools, and not obitools3? The grep is different (i.e., including 'obi').

Any help would be massively appreciated, thank you!!

Extra unneeded folder for samples in SE data analysis

The scripts for the single-end data analysis is currently using the folder 04b_annotated_samples, but this should not be necessary and complicates the convergence of the PE and SE pipelines.

Different script required for read merging for multiplexed or demultiplexed?

Can 01a_read_merging.sh and 01a_read_merging_no_prime.sh be merged into a single script (these are for multiplexed and de-multiplexed, respectively.

The cp -l script isn't well defined enough,

cp -l 02_raw_data/your_file_R1_001.fastq 03_merged/your_file_ali.fq

Input fille is 'sample_S1_L001_R1_001.fastq' and the expected in the merged is 03_merged/sample_S1_L001_ali.fq

Does the input need to be .fastq.gz or .fastq ?

When using ngsfilter or illuminapairedend, they require input as .fastq, but currently we are using .fastq.gz as the primer removal (cutadapt) input in the script.
Need to make this parallel.

SE data - the cutadapt output file name is needed for scripts

Currently 01_scripts/03_retain_unique.sh is operating on the 'cut to 230 bp' fastq file, but this needs to be specifically pointed to, as the standard works on *assi.fq instead of *assi_230.fq.
Need to find a solution that when cutadapt works, the following script is not needed to fix it.

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.