Code Monkey home page Code Monkey logo

bactseq's Introduction

BactSeq

BactSeq

Nextflow run with conda run with docker run with singularity

Introduction

BactSeq is a Nextflow pipeline for performing bacterial RNA-Seq analysis.

News and updates

  • 12/06/24: Dockerfile updated to install environment from yml. There were issues with installing a couple of R packages in the previous version.
  • 06/10/23: Docker and Singularity support now working again. The pipeline will automatically pull the image from Docker Hub with default settings.
  • 28/09/23: Please see example contrasts table and functional enrichment file below in README.

Pipeline summary

The pipeline will perform the following steps:

  1. Trim adaptors from reads (Trim Galore!)
  2. Read QC (FastQC)
  3. Align reads to reference genome (BWA-MEM)
  4. Size-factor scaling and gene length (RPKM) scaling of counts (Using median-of-ratios method from DESeq2 and TMM from edgeR)
  5. Principal component analysis (PCA) of normalised expression values
  6. Differential gene expression (DESeq2) (optional)
  7. Functional enrichment of differentially expressed genes (topGO) (optional)

Installation

You will need to install Nextflow (version 21.10.3+).

Usage:
nextflow run BactSeq --data_dir [dir] --sample_file [file] --ref_genome [file] --ref_ann [file] -profile conda [other_options]

Mandatory arguments:
  --data_dir [file]               Path to directory containing FastQ files.
  --ref_genome [file]             Path to FASTA file containing reference genome sequence (bwa) or multi-FASTA file containing coding gene sequences (kallisto).
  --sample_file [file]            Path to file containing sample information.
  -profile [str]                  Configuration profile to use.
                                  Available: conda, docker, singularity.

Other options:
  --aligner [str]                 (Pseudo-)aligner to be used. Options: `bwa`, `kallisto`. Default = bwa.
  --ref_ann [file]                Path to GFF file containing reference genome annotation. Required only if bwa aligner used.
  --cont_tabl [file]              Path to tsv file containing contrasts to be performed for differential expression.
  --fragment_len [str]            Estimated average fragment length for kallisto transcript quantification (only required for single-end reads). Default = 150.
  --fragment_sd [str]             Estimated standard deviation of fragment length for kallisto transcript quantification (only required for single-end reads). Default = 20.
  --func_file [file]              Path to GFF3-format file containing functional annotations.
  --l2fc_thresh [str]             Absolute log2(FoldChange) threshold for identifying differentially expressed genes. Default = 1.
  --outdir [file]                 The output directory where the results will be saved (Default: './results').
  --paired [str]                  Data are paired-end.
  --p_thresh [str]                Adjusted p-value threshold for identifying differentially expressed genes. Default = 0.05.
  --skip_trimming [bool]          Do not trim adaptors from FastQ files.
  --strandedness [str]            Is data stranded? Options: `unstranded`, `forward`, `reverse`. Default = reverse.
  -name [str]                     Name for the pipeline run. If not specified, Nextflow will automatically generate a random mnemonic.
  -resume                         Re-start the pipeline if it has been previously run.

Explanation of parameters:

  • ref_genome: genome sequence for mapping reads.
  • ref_ann: annotation of genes/features in the reference genome.
  • sample_file: TSV file containing sample information (see below)
  • data_dir: path to directory containing FASTQ files.
  • paired: data are paired-end (default is to assume single-end)
  • strandedness: is data stranded? Options: unstranded, forward, reverse. Default = reverse.
  • cont_tabl: (optional) table of contrasts to be performed for differential expression.
  • func_file: (optional) functional annotation file - if provided, functional enrichment of DE genes will be performed.
  • p_thresh: adjusted p-value threshold for identifying differentially expressed genes. Default = 0.05.
  • l2fc_thresh: absolute log2(FoldChange) threshold for identifying differentially expressed genes. Default = 1.
  • skip_trimming: do not trim adaptors from reads.
  • outdir: the output directory where the results will be saved (Default: ./results).

Required inputs

  • Note: See the test data folder for example mandatory inputs for a minimal run. Also see examples below.

  • Genome sequence: FASTA file containing the genome sequence. Can be retrieved from NCBI.

  • Gene annotation file: GFF file containing the genome annotation. Can be retrieved from NCBI.

  • Sample file: TSV file containing sample information. Must contain the following columns:

    • sample: sample ID
    • file_name: name of the FASTQ file.
    • group: grouping factor for differential expression and exploratory plots.
    • rep_no: repeat number (if more than one sample per group).

    Example:

    If data are single-end, leave the file2 column blank.

    sample	file1   file2	group	rep_no
    AS_1	SRX1607051_T1.fastq.gz	    Artificial_Sputum	1
    AS_2	SRX1607052_T1.fastq.gz	    Artificial_Sputum	2
    AS_3	SRX1607053_T1.fastq.gz	    Artificial_Sputum	3
    MB_1	SRX1607054_T1.fastq.gz	    Middlebrook	1
    MB_2	SRX1607055_T1.fastq.gz	    Middlebrook	2
    MB_3	SRX1607056_T1.fastq.gz	    Middlebrook	3
    ER_1	SRX1607060_T1.fastq.gz	    Erythromycin	1
    ER_2	SRX1607061_T1.fastq.gz	    Erythromycin	2
    ER_3	SRX1607062_T1.fastq.gz	    Erythromycin	3
    KN_1	SRX1607066_T1.fastq.gz	    Kanamycin	1
    KN_2	SRX1607067_T1.fastq.gz	    Kanamycin	2
    KN_3	SRX1607068_T1.fastq.gz	    Kanamycin	3

Optional inputs

  • Contrasts table: TSV file containing contrasts to be performed to identify differentially expressed genes. Contains 2 columns, representing the groups (as defined in Samples file) to be contrasted.

    Example:

    Condition1  Condition2
    Artificial_Sputum Middlebrook
    Artificial_Sputum Kanamycin
    Artificial_Sputum Erythromycin
    Middlebrook Erythromycin
    Middlebrook Kanamycin
    Erythromycin  Kanamycin
  • Functional annotation file: CSV file containing functional categories for genes. Enrichment testing will be performed on results from differential gene expression contrasts. First column contains the gene ID (must match the gene IDs in locus_tag of the GFF annotation file); second column contains the functional groups (GO terms).

    Example:

    MAB_0013c,"GO:0003674,GO:0003824,GO:0008150,GO:0008152,GO:0016407,GO:0016740,GO:0016746,GO:0016747"
    MAB_0018c,"GO:0003674,GO:0003824,GO:0005575,GO:0008150,GO:0008152,GO:0008168,GO:0016020,GO:0016021,GO:0016740,GO:0016741,GO:0031224,GO:0032259,GO:0044425"
    

Output

  1. trim_galore directory containing adaptor-trimmed RNA-Seq files and FastQC results.
  2. read_counts directory containing:
    1. ref_gene_df.tsv: table of genes in the annotation.
    2. gene_counts.tsv: raw read counts per gene.
    3. deseq_counts.tsv: size factor-scaled, log2 counts matrix, normalised using DESeq2.
    4. cpm_counts.tsv: size factor-scaled, log2 counts per million (CPM) matrix, normalised using edgeR.
    5. rpkm_counts.tsv: size factor scaled and gene length-scaled counts, expressed as reads per kilobase per million mapped reads (RPKM).
  3. PCA_samples directory containing principal component analysis results.
  4. diff_expr directory containing differential expression results.
  5. func_enrich directory containing functional enrichment results (optional).

bactseq's People

Contributors

adamd3 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar

Forkers

caizhaohui

bactseq's Issues

failing to perform DESeq2 on the provided example

Hi!
Thank you for developing bactseq!
I'm having error while trying to run deseq2 on the test_data.
The command:
nextflow run /tools/BactSeq/BactSeq --data_dir /tools/BactSeq/BactSeq/test_data/ --ref_genome /tools/BactSeq/BactSeq/test_data/Mabs.fasta --ref_ann /tools/BactSeq/BactSeq/test_data/Mabs.gff3 --sample_file /tools/BactSeq/BactSeq/test_data/sample_sheet_1.tsv -profile singularity --outdir ./test_singularity4 --cont_tabl /tools/BactSeq/BactSeq/test_data/contratst.txt

The contrasts file is attached.
contratst.txt

error log:
executor > local (13)
[64/bd83ac] process > MAKE_META_FILE (sample_sheet_1.tsv) [100%] 1 of 1 ✔
[6d/de3a9d] process > TRIMGALORE (3C3) [100%] 4 of 4 ✔
[1f/6294e1] process > MAKE_BWA_INDEX (Mabs.fasta) [100%] 1 of 1 ✔
[ca/258ef2] process > BWA_ALIGN (3D3) [100%] 4 of 4 ✔
[12/77d6fc] process > COUNT_READS (Mabs.gff3) [100%] 1 of 1 ✔
[03/53307a] process > NORMALISE_COUNTS (gene_counts_pc.tsv) [ 0%] 0 of 1
[- ] process > PCA_SAMPLES -
[b0/02a06a] process > DIFF_EXPRESSION (gene_counts_pc.tsv) [ 0%] 0 of 1
ERROR ~ Error executing process > 'DIFF_EXPRESSION (gene_counts_pc.tsv)'

Caused by:
Process DIFF_EXPRESSION (gene_counts_pc.tsv) terminated with an error exit status (1)

Command executed:

[ ! -f contrast_table.tsv ] && ln -s contratst.txt contrast_table.tsv
diffexpr.R -p 0.05 -l 1 -o ./

Command exit status:
1

Command output:
(empty)

Thank you for any kind of help!

Differential expression error

Dear Adam

I'm hoping you can perhaps help with the following error. I ran the pipeline without setting a contrast file previously. Worked perfectly. Now I'm trying to repeat with the contrast file. It's a simple two condition experiment. I have attached the command.err file for the step as well as the sample sheet, contrasts file and my stdout file

Running on a HPC with Ubuntu 20 installed.
The command I ran was within a slurm launcher. I've had to pull the github repository in a rather unusual way. The command after exporting singularity temp paths is:

nextflow run https://github.com/adamd3/BactSeq -latest -r main --paired --data_dir /scratch/sysuser/jonathan/ops/MRC/DE_RNA --sample_file /scratch/sysuser/jonathan/ops/MRC/DE_RNA/samplesheet.txt --ref_genome /scratch/sysuser/jonathan/ops/MRC/DE_RNA/GCF_000195955.2_ASM19595v2_genomic.fna --ref_ann /scratch/sysuser/jonathan/ops/MRC/DE_RNA/genomic.gff --cont_tabl contrasts.tsv -profile singularity

I recognize the group_INF_vs_EFF as a DESEQ2 parameter but I don't know what causes the error or what mistake I may have made.

Kind regards,
Jonathan
stdout.txt
command.txt
samplesheet.txt
contrasts.txt

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.