Code Monkey home page Code Monkey logo

scafe's Introduction

SCAFE (Single Cell Analysis of Five-prime Ends)

          5'-O~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~AAA-3'
                       O~~~AA      O~~         O~       O~~~~~~~AO~~~~~~~~A
                     O~~    O~~ O~~   O~~     O~O~~     O~~      O~~       
                      O~~      O~~           O~  O~~    O~~      O~~       
                        O~~    O~~          O~~   O~~   O~~~~~AA O~~~~~~A  
                           O~~ O~~         O~~~~~A O~~  O~~      O~~       
                     O~~    O~~ O~~   O~~ O~~       O~~ O~~      O~~       
                       O~~~~A     O~~~   O~~         O~~O~~      O~~~~~~~AA
      ┌─ᐅ 5'-O~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~-3'
...===┴========================================================================================...

SCAFE (Single Cell Analysis of Five-prime Ends) provides an end-to-end solution for processing of single cell 5’end RNA-seq data. It takes a read alignment file (*.bam) from single-cell RNA-5’end-sequencing (e.g. 10xGenomics Chromimum®), precisely maps the cDNA 5'ends (i.e. transcription start sites, TSS), filters for the artefacts and identifies genuine TSS clusters using logistic regression. Based on the TSS clusters, it defines transcribed cis-regulatory elements (tCRE) and annotated them to gene models. It then counts the UMI in tCRE in single cells and returns a tCRE UMI/cellbarcode matrix ready for downstream analyses, e.g. cell-type clustering, linking promoters to enhancers by co-activity etc.

Table of contents

Publications

Jonathan Moody and Tsukasa Kouno et al. Profiling of transcribed cis-regulatory elements in single cells. bioRxiv 2021.04.04.438388

Versions

v1.0.0 (Lastest Stable release) [June 6, 2022]

  • auto detects mode for TSO identification and trimming (in scafe.tool.sc.bam_to_ctss)
  • use tabix/bgzip for fast indexing of ctss bed
  • annotate hyperactive distal locus, analogous to super-enhancer (in scafe.tool.cm.annotate)
  • annotate potential alternative promoters beyond reference transcript 5'ends (in scafe.tool.cm.annotate)
  • aggregate multiple ctss without considering UMI/Cellbarcode, for scalability, replacing 'pool' tools and workflows (in scafe.tool.cm.aggregate and scafe.workflow.cm.aggregate)
  • scafe.tool.sc.link is now obsolete and thus monocle3 and cicero are not needed.
  • calculate the directionality of tCREs (in scafe.tool.cm.directionality)

v0.9.0-beta [March 20, 2021]

  • Initial pre-release
  • beta phase, likely buggy and has performance issues.

What does SCAFE do?

SCAFE extracts transcribed cis-regulatory elements from single-cell RNA-5’end-sequencing data

Profiling of cis-regulatory elements (CREs, mostly promoters and enhancers) in single cells allows us to interrogate the cell-type specific contexts of gene regulation and genetic predisposition to diseases. Single-cell RNA-5’end-sequencing methods (sc-end5-seq, available from 10xGenomics Chromimum®) theorectically captures the 5'end of cDNA, which represents transcription start sites (TSS). Measuring the RNA output at TSS allows us to precisely locate transcribed CREs (tCREs) on the genome, enabling the quantification of promoter and enhancer activities in single cells. Figure (a) shows the sc-end5-seq signal at the two promoters of gene DHX30. It highlights the consistency between sc-end5-seq and sc-ATAC-seq data, as well as the dynamic alternaitve TSS usage between cell states (i.e. resting and stimulated immune cells). SCAFE identify genuine TSS information from sc-end5-seq data. Figure (b) illustrates the stretagies of SCAFE to defined tCRE from TSS information. Figure (c) shows the proximal and distal tCREs defined by SCAFE at PTGER4 locus. Figure (d) shows the activities of the proximal and distal tCREs at PTGER4 locus in resting and simulated immune cells, demonstarting the capability of SCAFE to study CRE activities in single cells.

How does SCAFE do it?

SCAFE Core Tools and Workflows

SCAFE consists of a set of perl programs for processing of sc-end5-seq data. Major tools are listed in Figure (a). SCAFE accepts read alignment in *.bam format from 10xGenomics Chromimum® software cellranger. Tool bam_to_ctss extracts the 5’ position of reads, taking the 5’ unencoded-Gs into account. Tool remove_strand_invader removes read 5’ends that are strand invasion artifacts by aligning the TS oligo sequence to the immediate upstream sequence of the read 5’end. Tool cluster performs clustering of read 5’ends using 3rd-party tool Paraclu. Tool filter extracts the properties of TSS clusters and performs multiple logistic regression to distinguish genuine TSS clusters from artifacts. Tool annotate define tCREs by merging closely located TSS clusters and annotate tCREs based on their proximity to known genes. Tool count counts the number of UMI within each tCRE in single cells and generates a tCRE-Cell UMI count matrix. SCAFE tools were also implemented workflows for processing of individual samples or aggregateing of multiple samples.

P.S. SCAFE also accepts bulk 5'end RNA-Seq data (e.g. bulk CAGE). See scripts for details.

SCAFE discovers de novo genunie TSS clusters and tCREs

A fraction of TSS identified based on read 5′ends from template switching (TS) reactions (used in 10xGenomics Chromimum®) may not be genuine, attributed to various artefacts including strand invasion and other sources. This results in excessive artifactual TSS misidentified along the gene body, collectively known as “exon painting”. While strand invasion artefacts can be specifically minimized by considering the complementarity of TSS upstream sequence to TS oligo sequence, a non-negligible fraction of artefactual TSS remains after filtering for strand invasion. To minimize the artifactual TSS, SCAFE examines the properties of TSS clusters, as shown in Figure (b), and devised a classifier to identify genuine TSS based on multiple logistic regression. This classifier, i.e. logistic probability, achieved excellent performance with AUC>0.98 across sequencing depths and outperformed all individual metrics. This is implemented in the tool filter.

Dependencies

TL;DR

Go straight to the Docker Image section if you do not want to deal with dependencies and already have docker installed.

perl

SCAFE is mainly written in perl (v5.24.1 or later). All scripts are standalone applications and DO NOT require installations of extra perl modules. Check whether perl is properly installed on your system.

#--- Check your perl version
perl --version

R

SCAFE relies on R for logistic regression, ROC analysis and graph plotting. Rscript (v3.6.1 or later) and the following R packages have to be properly installed:

#--- Check your Rscript version, must be 3.6.1 ot later
Rscript --version

#--- Check your R packages, install if missing
Rscript -e 'install.packages("ROCR")'
Rscript -e 'install.packages("PRROC")'
Rscript -e 'install.packages("caret")'
Rscript -e 'install.packages("e1071")'
Rscript -e 'install.packages("ggplot2")'
Rscript -e 'install.packages("scales")'
Rscript -e 'install.packages("reshape2")'
Rscript -e 'install.packages("R.utils")'

Note: scafe.tool.sc.link is now obsolete since v1.0.0 now and thus monocle3 and cicero are not needed.

Other 3rd party applications

SCAFE also relies on a number of 3rd party applications. The binaries and executables (Linux) of these applications are distributed with this reprository in the directory ./resources/bin and DO NOT require installations.

OS

SCAFE was developed and tested on Debian GNU/Linux 9, with R (3.6.1) and perl (5.24.1) installed. Running SACFE on other OS with other version of R and perl are not guranteed. In you want to run SCAFE on other OS, we would recommend running it from docker container, see below. If you would like to run SCAFE natively on other OS, you have to ensure the R and perl versions, and might consider downloading and compiling the other 3rd party applications from their own sources. The binaries of 3rd party applications have to be execuatble in SCAFE directoty at ./resources/bin/.

Installation

Once you ensured the above dependencies are met, you are ready to download SCAFE to your system.

Clone this respository

#--- make a directory to install SCAFE
mkdir -pm 755 /my/path/to/install/
cd /my/path/to/install/

#--- Obtain SCAFE from github
git clone https://github.com/chung-lab/SCAFE
cd SCAFE

#--- export SCAFE scripts dir to PATH for system-wide call of SCAFE commands 
echo "export PATH=\$PATH:$(pwd)/scripts" >>~/.bashrc
source ~/.bashrc

#--- making sure the scripts and binaries are executable
chmod 755 -R ./scripts/
chmod 755 -R ./resources/bin/

Check the dependencies

To ensure the dependencies, please run check.dependencies.

#--- run check.dependencies to check 
scafe.check.dependencies

Docker image

If you have docker installed on your system, you might also consider pulling the SCAFE docker image and run it in a docker container. Once you are logged into the SCAFE docker container, the following tutorial on the demo data can be ran with exactly the same command.

To install docker, please see here. Noted that all files reads/writes are within the docker container by default. To share files (i.e. input and output of SCAFE) between the container and the host, please see here. If you are running Docker on a labtop, make sure the allocated resources (e.g. memory and disk space) are enough, see here. We suggest to allocate at least 16GB of memory.

#---to pull the docker image
docker pull cchon/scafe:latest

#---to run scafe within a docker container, run
docker run -it cchon/scafe:latest

The Docker image can also be used with Singularity container platform, with additional options specified as the followings.

#---to pull the docker image using singularity
singularity pull docker://cchon/scafe:latest

#---to run scafe within a singularity container, run
singularity shell --writable --env APPEND_PATH=/SCAFE/scripts --env LC_ALL=C docker://cchon/scafe:latest

Getting started with demo data

Now you have enssured all dependencies and downloaded SCAFE, time to get the demo data and test a few runs on the demo data.

Download demo data and reference genome

Demo data and reference genome must be downloaded for testing SCAFE on your system.

#--- download the demo data using script download.demo.input
scafe.download.demo.input
	
#--- download the reference genome hg19.gencode_v32lift37 for testing demo data
scafe.download.resources.genome --genome=hg19.gencode_v32lift37

Test run a single cell solo workflow for demo data

Now, let's test SCAFE with a workflow (workflow.sc.solo) that processes one library of single cell 5'end RNA-seq data. First we check out the help message of workflow.sc.solo. Remember you can always check the help message for all scripts of SCAFE using the --help flag.

#--- check out the help message of workflow.sc.solo
scafe.workflow.sc.solo --help

It should print the help message as the followings:

Usage:
               5'-O~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~AAA-3'
                            O~~~AA      O~~         O~       O~~~~~~~AO~~~~~~~~A
                          O~~    O~~ O~~   O~~     O~O~~     O~~      O~~       
                           O~~      O~~           O~  O~~    O~~      O~~       
                             O~~    O~~          O~~   O~~   O~~~~~AA O~~~~~~A  
                                O~~ O~~         O~~~~~A O~~  O~~      O~~       
                          O~~    O~~ O~~   O~~ O~~       O~~ O~~      O~~       
                            O~~~~A     O~~~   O~~         O~~O~~      O~~~~~~~AA
           ┌─ᐅ 5'-O~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~-3'
     ...===┴========================================================================================...

                          Single Cell Analysis of Five-prime End (SCAFE) Tool Suite 
                                    ---> scafe.workflow.sc.solo <---
                      <--- workflow, single-cell mode, process a single sample --->

     Description:
       This workflow process a single sample, from a cellranger bam file to tCRE UMI/cellbarcode count matrix

     Usage:
       scafe.workflow.sc.solo [options] --run_bam_path --run_cellbarcode_path --genome --run_tag --run_outDir
   
       --run_bam_path                  <required> [string] bam file from cellranger, can be read 1 only or pair-end
       --run_cellbarcode_path          <required> [string] tsv file contains a list of cell barcodes,
                                                           barcodes.tsv.gz from cellranger
       --genome                        <required> [string] name of genome reference, e.g. hg19.gencode_v32lift37
       --run_tag                       <required> [string] prefix for the output files
       --run_outDir                    <required> [string] directory for the output files
       --training_signal_path          (optional) [string] quantitative signal (e.g. ATAC -logP, in bigwig format), or binary genomic 
                                                           regions (e.g. annotated CRE, in bed format) used for training of logical 
                                                           regression model If null, $usr_glm_model_path must be supplied for 
                                                           pre-built logical regression model. It overrides usr_glm_model_path 
                                                           (default=null)
       --testing_signal_path           (optional) [string] quantitative signal (e.g. ATAC -logP, in bigwig format), or binary genomic 
                                                           regions (e.g. annotated CRE, in bed format) used for testing the performance 
                                                           of the logical regression model. If null, annotated TSS from $genome will be 
                                                           used as binary genomic regions. (default=null)
       --detect_TS_oligo (optional) [match|trim|skip|auto] in bam_to_ctss step, the modes of detecting TS oligo. 1. match: search for 
                                                           TS oligo sequence on the read, identify the TSO/cDNA junction as 5'end of 
                                                           the read. This works only when the error rate of the TS oligo region on 
                                                           the read is low, otherwise a considerable number of read will be invalid. 
                                                           2. trim: assuming the 1st N bases of the reads are TS oligo, without 
                                                           checking the actual sequence. N is determined by the length of TS oligo. 
                                                           3. skip: assuming the TS oligo was not sequenced, the 1st base of the read
                                                           will be treated as the 1st base after the TS oligo. 4. auto: automatically 
                                                           determines the best mode, best of the observed error rate of the TS oligo
                                                           and the frequency of 5'end softclipped bases by the aligner. If softcliped 
                                                           bases is close to the length of TS oligo, mode 1 or 2 will be chosen, 
                                                           depending on the observed error rate of the TS oligo (error rate <= 0.1, 
                                                           mode 1 will be chosen or mode 2 otherwise). If softcliped base os close to 
                                                           zero, mode 3 will be chosen. (default=auto).
       --max_thread                   (optional) [integer] maximum number of parallel threads, capped at 10 to 
                                                           avoid memory overflow (default=5)
       --overwrite                     (optional) [yes/no] erase run_outDir before running (default=no)

     Dependencies:
       R packages: 'ROCR','PRROC', 'caret', 'e1071', 'ggplot2', 'scales', 'reshape2'
       bigWigAverageOverBed
       bedGraphToBigWig
       bedtools
       samtools
       paraclu
       paraclu-cut.sh
       tabix
       bgzip

     For demo, cd to SCAFE dir and run,
       scafe.workflow.sc.solo \
       --overwrite=yes \
       --run_bam_path=./demo/input/sc.solo/demo.cellranger.bam \
       --run_cellbarcode_path=./demo/input/sc.solo/demo.barcodes.tsv.gz \
       --genome=hg19.gencode_v32lift37 \
       --run_tag=demo \
       --run_outDir=./demo/output/sc.solo/

The help message details the input options, noted some are <required> and (optional). workflow.sc.solo takes a bam file (--run_bam_path=), a cellbarcode list file (--run_cellbarcode_path=), the corresponding reference genome (--genome=) and the output prefix (--run_tag=) and directory (--run_outDir=). At the end of the message, it also prints the commands for running the script on demo data. Now, copy the command and run as the following:

#--- run the workflow on the demo.cellranger.bam, it'll take a couple of minutes 
scafe.workflow.sc.solo \
--overwrite=yes \
--run_bam_path=./demo/input/sc.solo/demo.cellranger.bam \
--run_cellbarcode_path=./demo/input/sc.solo/demo.barcodes.tsv.gz \
--genome=hg19.gencode_v32lift37 \
--run_tag=demo \
--run_outDir=./demo/output/sc.solo/

If it finishes smoothly, you should see the following message on screen:

 #=====================================================
 Results of all tasks found. All tasks run successfully
 #=====================================================

You can also check out some of the outputs:

#--- check the output of tCRE annotation tables
ls -alh ./demo/output/sc.solo/annotate/demo/log

#--- check the output of tCRE UMI count matrix
ls -alh ./demo/output/sc.solo/count/demo/matrix

Test run all workflows on bulk and single cell data

We recommended a test run for all 5 available workflows on the demo data. It will take around ~20 minutes on a regular system running at 5 threads (default).

#--- check out the help message of demo.test.run
scafe.demo.test.run --help

#--- run the all six available workflows on the demo bulk and single
#--- it'll take a around 20 minutes
scafe.demo.test.run \
--overwrite=yes \
--run_outDir=./demo/output/

Run SCAFE on your own data

Input *.bam files

SCAFE maps the cDNA 5'end by identifying the junction between the TS oligo and the cDNA on Read 1 of sc-end5-seq data on the 10xGenomics Chromimum® platform. Therefore, Read 1 must be sequenced long enough (e.g. >50nt) to allow mappnig to genome. The *.bam files are commonly generated from 10xGenomics Chromimum® cellranger count pipeline. When running cellranger count, the --chemistry option must be SC5P-PE (Pair-end). The *.bam generated from both --chemistry fiveprime and --chemistry SC5P-R2 options are NOT COMPATIBLE with SCAFE as the former will remove the junction between the TS oligo and the cDNA on Read 1 and the latter does not even contain Read 1. If users sequecnced Read 1 only, they could run cellranger count with --chemistry SC5P-PE option by supplying cellranger a dummy Read 2 fastq with Read 1 reverse complemented. If users wish to generate their own *.bam from other custom pipelines, make sure that:

  1. the UMI and cellbarcode information must be present in the *.bam file as custom tags CB:Z and UB:Z respectively, in this order.
  2. TS oligo sequence must keep intact on Read 1.

Run SCAFE workflows on single cell data with default options

We recommend most users to run SCAFE using workflows with default options. There are 3 types of workflow: (1) "solo" for processing of a single library, (2) "aggregate" for aggregating of multiple libraries and (3) "subsample" for down-sampling a single library (for assessment of sequencing depth). "solo" accepts *.bam while "aggregate"/"subsample" accepts *.ctss.bed files (generated from tool.sc.bam_to_ctss). For multiple libraries, we recommend users to first run either workflow.sc.solo or tool.sc.bam_to_ctss on indiviudal libraries, and then take the *.ctss.bed file from all libraries to run workflow.cm.aggregate. Aggregating of libraries for defining tCRE is recommended because (1) it generally increases the sensitivity of tCRE detection and (2) it produces a common set of tCREs for all libraries so the IDs are portable between libraries. Please check the help messages for details of running the workflows:

#--- check out the help message of the three single cell workflows
scafe.workflow.sc.solo --help
scafe.workflow.cm.aggregate --help
scafe.workflow.sc.subsample --help

Run SCAFE individual tools with custom options

For the sake of flexibiity, SCAFE allows users to run individual tools with custom options for exploring the effect of cutoffs or supplying alternative intermediate inputs. See here for the full list of tools and their usage. Let's walk through a couple of individual tools with the demo data.

  • tool.sc.bam_to_ctss: First, we convert a cellranger *.bam file to *.ctss.bed files. "ctss" refers as capped TSS, and *.ctss.bed file is a common format for storing TSS information. For procedural convenenice, tool.sc.bam_to_ctss generates multiple *.ctss.bed files at various levels of collapsing the signal (e.g. piling up UMI at TSS or not, summing up UMI of different cellbarcode or not). By default, the tool.sc.bam_to_ctss process ONLY primary alignments regardless of MAPQ. If users wants to, for example, include also secondary aligments with a minimum MAPQ (e.g. 10), the user could run as the followings:
#--- check out the help message of tool.sc.bam_to_ctss
scafe.tool.sc.bam_to_ctss --help

#--- run tool.sc.bam_to_ctss with custom options
scafe.tool.sc.bam_to_ctss \
--min_MAPQ 10 \
--exclude_flag '128,4' \
--overwrite=yes \
--bamPath=./demo/input/sc.solo/demo.cellranger.bam \
--genome=hg19.gencode_v32lift37 \
--outputPrefix=demo \
--outDir=./demo/output/sc.solo/bam_to_ctss/
  • tool.cm.remove_strand_invader: Then, we removes the strand invader artefacts from the *.ctss.bed file generated from tool.sc.bam_to_ctss. Please refer to here for the rationale of removing strand invader artefacts. If the users would like to use a more stringent cutoff to define strand invader artefacts, e.g. --min_edit_distance=3 and --min_end_non_G_num=1, so that less reads will be removed, and the user could run as the followings:
#--- check out the help message of tool.cm.remove_strand_invader
scafe.tool.cm.remove_strand_invader --help

#--- run tool.cm.remove_strand_invader with custom options
scafe.tool.cm.remove_strand_invader \
--min_edit_distance=3 \
--min_end_non_G_num=1 \
--overwrite=yes \
--ctss_bed_path=./demo/output/sc.solo/bam_to_ctss/demo/bed/demo.collapse.ctss.bed.gz \
--genome=hg19.gencode_v32lift37 \
--outputPrefix=demo \
--outDir=./demo/output/sc.solo/remove_strand_invader/
  • tool.cm.cluster: Then, we cluster the *.ctss.bed file into TSS clusters. Please refer to here for the rationale of clustering. By default, the clusters with <5 UMI within cluster, <3 UMI at summit or expressed in <3 cells were removed. If the users would like to use a more stringent cutoff to remove more lowly expressed TSS clusters, e.g. --min_cluster_count=10, --min_summit_count=5 and --min_num_sample_expr_cluster=5,the user could run as the followings:
#--- check out the help message of tool.cm.cluster
scafe.tool.cm.cluster --help

#--- run tool.cm.cluster with custom options
scafe.tool.cm.cluster \
--min_cluster_count=10 \
--min_summit_count=5 \
--min_num_sample_expr_cluster=5 \
--overwrite=yes \
--cluster_ctss_bed_path=./demo/output/sc.solo/bam_to_ctss/demo/bed/demo.collapse.ctss.bed.gz \
--outputPrefix=demo \
--outDir=./demo/output/sc.solo/cluster/
  • tool.cm.filter: Then, we filter the TSS cluster using logistic regression. By default, tool.cm.filter uses a pre-trained multiple logistic regression model from human iPSC cells using matched ATAC-Seq data. If the users would like to use their own matched ATAC-Seq data (–logP as *.bigwig file) for training of the regression model using --training_signal_path and --testing_signal_path options. Also, the user can set a permissive logistic probablity threshold (default=0.5) using --default_cutoff option (e.g. 0.3). The user could run as the followings:
#--- check out the help message of tool.cm.filter
scafe.tool.cm.filter --help

#--- run tool.cm.cluster with custom options
scafe.tool.cm.filter \
--default_cutoff=0.3 \
--overwrite=yes \
--ctss_bed_path=./demo/output/sc.solo/bam_to_ctss/demo/bed/demo.collapse.ctss.bed.gz \
--ung_ctss_bed_path=./demo/output/sc.solo/bam_to_ctss/demo/bed/demo.unencoded_G.collapse.ctss.bed.gz \
--tssCluster_bed_path=./demo/output/sc.solo/cluster/demo/bed/demo.tssCluster.bed.gz \
--training_signal_path=./demo/input/atac/demo.atac.bw \
--testing_signal_path=./demo/input/atac/demo.atac.bw \
--genome=hg19.gencode_v32lift37 \
--outputPrefix=demo \
--outDir=./demo/output/sc.solo/filter/
  • tool.cm.annotate: Finally, we will define and annotate tCREs based on the gene models in reference genome. By default, tool.cm.annotate merge TSS clusters located within +100nt and –400nt as a tCRE (defined by combination of option --CRE_extend_size and --CRE_extend_upstrm_ratio). Also, the tCREs with +/-500nt of annotated gene TSS will be assigned as proximal tCRE (defined by option --proximity_slop_rng). If the users would like to define tCRE by merging more distant TSS clusters (e.g. +200nt and –600nt) and assign tCRE further away (+/-1000nt) from gene TSS as proximal, the user could run as the followings:
#--- check out the help message of tool.cm.annotate
scafe.tool.cm.annotate --help

#--- run tool.cm.annotate with custom options
scafe.tool.cm.annotate \
--CRE_extend_size=800 \
--CRE_extend_upstrm_ratio=3 \
--proximity_slop_rng=1000 \
--overwrite=yes \
--tssCluster_bed_path=./demo/output/sc.solo/filter/demo/bed/demo.tssCluster.default.filtered.bed.gz \
--tssCluster_info_path=./demo/output/sc.solo/filter/demo/log/demo.tssCluster.log.tsv \
--genome=hg19.gencode_v32lift37 \
--outputPrefix=demo \
--outDir=./demo/output/sc.solo/annotate/

Making a custom reference genome

Currently, four reference genomes ara available. See ./scripts/scafe.download.resources.genome for downloading. Alternatively, some users might work on genomes of other organisms, or prefer to use custom gene models for annotating tCREs. tool.cm.prep_genome converts user-supplied genome *.fasta and gene model *.gtf into necessary files for SCAFE. You can check out the help message for inputs of tool.cm.prep_genome and then test run a demo using TAIR10 genome with AtRTDv2 gene model.

#--- check out the help message of tool.cm.prep_genome
scafe.tool.cm.prep_genome --help

#--- run the tool on the TAIR10 assembly with gene model AtRTDv2 
scafe.tool.cm.prep_genome \
--overwrite=yes \
--gtf_path=./demo/input/genome/TAIR10.AtRTDv2.gtf.gz \
--fasta_path=./demo/input/genome/TAIR10.genome.fa.gz \
--chrom_list_path=./demo/input/genome/TAIR10.chrom_list.txt \
--mask_bed_path=./demo/input/genome/TAIR10.ATAC.bed.gz \
--outputPrefix=TAIR10.AtRTDv2 \
--outDir=./demo/output/genome/

Run SCAFE with bulk CAGE data

SCAFE also accepts .*bam files from bulk CAGE. The major difference between singel cell and bulk workflow is cellbarcode is not considered. Otherwise, the options between single cell and bulk workflow are large the same. Please check the help messages for details of running the workflows:

#--- check out the help message of the three single cell workflows
scafe.workflow.bk.solo --help
scafe.workflow.cm.aggregate --help
scafe.workflow.bk.subsample --help

Wishlist

  • bam_to_ctss error check for various problems in bam file, e.g. bams with read 2 only
  • enable the use of system-wide bins for common 3rd party software (e.g. bedtools samtools)
  • update genome resources with updated genocode versions

scafe's People

Contributors

chung-lab avatar jenchien avatar tomkellygenetics avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

scafe's Issues

Two very nearby clusters don't get merged, confusion ensues in `directionality`

I took a look at the directionality output, and spotted a CRE (chr10_100185675_100186176_+) which had ~200 negative strand reads and ~25 positive strand reads. Intrigued, I delved into the various intermediate files to investigate. Apparently two filtered clusters (chr10_100185889_100186059_- and chr10_100186073_100186094_+) fall within the annotated CRE's range, but rather than get merged they translate to separate, overlapping CREs (the other one becomes chr10_100185928_100186429_-).

My best guess as to why the two weren't merged is that all the .-oriented CREs in the output have a typeStr of unanno_tss and a gene_promoter of none or unannotated. Meanwhile chr10_100185928_100186429_- is gene_tss and annotated respectively. chr10_100185675_100186176_+ is unanno_tss and none, but I presume the strandedness conflict is keeping it unmerged. I see instances of gene_tsses and unanno_tsses merged, but the standedness agrees.

Nevertheless, the end result is a pair of overlapping CREs that then cause confusion in directionality. Would it make sense to somehow refine annotate coordinate extension to avoid these sort of situations from occurring? Or does the solution lie elsewhere?

Some other musings that may be useful to whoever's reading:

  • I had a ton of CREs with very small scores as a result of the default directionality performed in the 10X workflow. Passing the filtered cluster file as --ctss_scope_bed_path made unidirectional a lot more prevalent. This obviously did nothing in the case discussed above, but still felt worth noting.
  • The count function explicitly uses the +/-/. strandedness stored within the name (and BED) of the CRE. This whole thing was brought on by me thinking if there's a way to propagate the directionality information to a per-cell level, filtering the CB.ctss.bed to +/- respectively, running individual count calls, and then finding the overlap in features captured to be the . CREs exclusively. As is, count and directionality tell you different things.

using 10X references

Hello,

I'm interested in using the tool, and would ideally be able to use the same reference genome as what was used for the cellranger mapping (e.g. GRCh38-2020-A). However, it seems that only the GTF and fasta are present as part of the reference. The chromosome list should be trivial to generate, but how would I go about obtaining the BED? A cursory googling of ENCODE CRE did not turn up anything promising.

Thanks a lot and sorry for the trouble.

scafe.workflow.sc.solo dies at line 551

Hi, I successfully installed SCAFE and now I want to run scafe.workflow.sc.solo with my own data using the default options. However, I get following error message and was wondering why the script died? is it missing some files?

Thanks a lot in advance.

Screen Shot 2021-04-05 at 12 49 52

Results of count not found. Quitting

Hello Dr. Chung, I am really really sorry. But I cannot fix it. I still run the scafe.workflow.sc.pool
My lib is like this:
S17765-F11_S33 33 ./S33_SCAFE_out/bam_to_ctss/S17765-F11_S33/bed/S17765-F11_S33.UMI_CB.ctss.bed.gz ./S17765-F11_S33_out/outs/filtered_feature_bc_matrix/S17765-F11_S33.barcodes.tsv.gz ./S33_SCAFE_out/bam_to_ctss/S17765-F11_S33/bed/S17765-F11_S33.CB.ctss.bed.gz S17765-F11_S34 34 ./S34_SCAFE_out/bam_to_ctss/S17765-F11_S34/bed/S17765-F11_S34.UMI_CB.ctss.bed.gz ./S17765-F11_S34_out/outs/filtered_feature_bc_matrix/S17765-F11_S34.barcodes.tsv.gz ./S34_SCAFE_out/bam_to_ctss/S17765-F11_S34/bed/S17765-F11_S34.CB.ctss.bed.gz S17765-F11_S35 35 ./S35_SCAFE_out/bam_to_ctss/S17765-F11_S35/bed/S17765-F11_S35.UMI_CB.ctss.bed.gz ./S17765-F11_S35_out/outs/filtered_feature_bc_matrix/S17765-F11_S35.barcodes.tsv.gz ./S35_SCAFE_out/bam_to_ctss/S17765-F11_S35/bed/S17765-F11_S35.CB.ctss.bed.gz S17765-F11_S36 36 ./S36_SCAFE_out/bam_to_ctss/S17765-F11_S36/bed/S17765-F11_S36.UMI_CB.ctss.bed.gz ./S17765-F11_S36_out/outs/filtered_feature_bc_matrix/S17765-F11_S36.barcodes.tsv.gz ./S36_SCAFE_out/bam_to_ctss/S17765-F11_S36/bed/S17765-F11_S36.CB.ctss.bed.gz

My code is
nohup scafe.workflow.sc.pool --overwrite=yes --lib_list_path=./lib_list_path.txt --genome=hg38.gencode_v32 --run_tag=demo --run_outDir=./sc.pool &

However it gives me this error.
image

This time, I already check the lib_list_path clearly. However, it still throw this error. Hope to get your help. Thank you!

demo output of calculateDirectionality reveals mis-classified directionality?

the subroutine calculateDirectionality adds a 'score' to the output file:

SCAFE/demo/output/sc.solo/directionality/demo/bed/demo.CRE.directionality.summit_center.bed.gz

According to the sub, we have that

  1. $strand_diff = abs($fwd_count-$rev_count);
  2. the score column of the above bed file contains $directionality = 1-($strand_diff/$strand_sum);

In other words, if $fwd_count==$rev_countthen score=1 in the bed file.

Running the demo, there is exactly 1 locus with bi-directional transcription (strand==".") and about 30 loci where the score column of the bed file is > 0.9 denoting a strand count imbalance of less than 10%. Where 5 of these loci surprisingly have a score of 1, which should mean completely balanced fwd and reverse strand counts. Such transcriptional activity seems intuitively bidirectional:

image

dropbox access failed

Dear SCAFE team,

Check your dropbox files, please.
It seems that the files that need to be downloaded through the scafe.download.demo.input command are not available.

How are gene promoter decisions made for unannotated promoters?

Great tool :) I'm just wondering about the gene promoter annotation.

In CRE.info.tsv the gene_promoter value is either annotated, unannotated or none. I understand that gene_promoter = annotated applies to CREs that are both proximal and on the same strand (ss) as a matching transcript TSS. And all CREs where gene_promoter = unannotated are on the same strand (ss) and have regionType listed as either "exon" or "intron".

But I also have many thousands of CREs with ss and regionType = exon | intron that are simply listed as gene promoter = "none".

So is there an extra piece of information deciding whether to call a CRE as an unannotated promoter? And if so what is that?

pooling of multiple samples

Hi,

I am curious about how the scafe.workflow.sc.pool is functioning to merge all the tCREs across samples. As mentioned in README, I ran workflow.sc.solo on an individual sample first and used scafe.workflow.sc.pool to pool all the 64 samples together. In the single sample run, I got ~15,000 tCREs on average. But after pooling only ~9,800 was returned as in common. I am thinking if the merging criteria are too strict for my dataset.

I didn't find a description of how scafe.workflow.sc.pool is doing the pool step. It would be great if you can help to explain.

Thanks!

Bedtools issue

Hey,

I've installed all dependencies successfully, however SCAFE can't seem to find bedtools. Below is my attempt at running the demo data alongside proof that bedtools is both in the ./resources/bin folder and in my path (note: scafe.check.dependencies also can't find bedtools, I've also tried removing bedtools from my path in case there is a clash but same error occurs).

[regmgbe@login13 SCAFE]$ scafe.workflow.sc.solo --overwrite=yes --run_bam_path=./demo/input/sc.solo/demo.cellranger.bam --run_cellbarcode_path=./demo/input/sc.solo/demo.barcodes.tsv.gz --genome=hg19.gencode_v32lift37 --run_tag=demo --run_outDir=./demo/output/sc.solo/

#===========================#
Start running bam_to_ctss
#===========================#


=========================================================================
[2021-04-22 09:57] starts running ...... 
=========================================================================

[2021-04-22 09:57] Checking all SCAFE executables          
bedtools is not installed properly. Quitting.

#========================================#
Results of bam_to_ctss not found. Quitting.
#========================================#
Died at /home/regmgbe/Programmes/SCAFE/scripts/scafe.workflow.sc.solo line 551.
[regmgbe@login13 SCAFE]$ bedtools --version
bedtools v2.25.0
[regmgbe@login13 SCAFE]$ ls -l resources/bin/bedtools/
total 36516
-rwxrwxrwx 1 regmgbe regmx0 37387240 Apr 21 14:37 bedtools

Any help would be greatly appreciated!

failed at running scafe.tool.sc.bam_to_ctss

Hello, thank you for developing this tool. I met a problem when i use it.
there is my script:
scafe.tool.sc.bam_to_ctss
--min_MAPQ 10
--exclude_flag '128,4'
--overwrite=yes
--bamPath=./possorted_genome_bam.bam
--genome=mm10.gencode_vM25
--outputPrefix=06_vm25_test
--outDir=./06_vm25_test/output/sc.solo/bam_to_ctss/

there is output :
[2023-12-18 13:07] Checking all SCAFE executables
[2023-12-18 13:07] Checking: tabix version: 1.9
[2023-12-18 13:07] Checking: bgzip version: 1.9
[2023-12-18 13:07] Checking: bedtools version: 2.30.0
[2023-12-18 13:07] Checking: samtools version: Version: 1.3.1
[2023-12-18 13:07] Checking: paraclu found.
[2023-12-18 13:07] Checking: paraclu-cut found.
[2023-12-18 13:07] Checking: bedGraphToBigWig version: 2.8
[2023-12-18 13:07] Checking: bigWigAverageOverBed version: 2
[2023-12-18 13:07] Checking TSO presence in bam
[2023-12-18 13:07] Determining subsampe fraction
Illegal division by zero at /SCAFE/scripts/scafe.tool.sc.bam_to_ctss line 1406.

I see here are some similar questions, but no answer. How can i solve it?

Fail to build with singularity

When trying to install SCAFE with singularity as it's in README.md I run into the error that "singularity shell --writable" is working only for singularity sandboxes, but the docker://cchon/scafe:latest is an image. Please correct this.
image

genome does not have chrom_name_path

Hello, thank you for developing this awesome tool. I meet trouble when I use it.
When I use the scafe.workflow.sc.solo, it always tell me that genome does not have chrom_name_path.
My code is looking like this.
scafe.workflow.sc.solo --run_bam_path ./S17765-F11_S33_out/outs/possorted_genome_bam.bam --run_cellbarcode_path ./S17765-F11_S34_out/outs/filtered_feature_bc_matrix --genome /storage/yhhuang/users/ruiyan/software/SCAFE/resources/genome/my_hg38_encode --run_tag S17765-F11_S33 --run_outDir ./S33_SCAFE_out

However, I got the error.
image

Hope to get your help. I found that some people have the same questions. But he solved it by re-downloading the reference genome. However, I did it according to his analysis. I found that it is not helpful to me. On the one hand, I download the resources.genome. ./scafe.download.resources.genome --genome hg38.gencode_v32 On the other hand, I run the scafe.tool.cm.prep_genome. All of them do not work for me.

Thank you in advance.

Illegal division by zero at scafe.tool.sc.bam_to_ctss

Hello,

I am facing a very similar error to #20:

Illegal division by zero at ./SCAFE/scripts/scafe.tool.sc.bam_to_ctss line 1406.
Died at ./SCAFE/scripts/scafe.workflow.sc.solo line 595.

I have checked I do have both R1 and R2 in my BAM file and the sequences are generally 95nt long. Demo tests ran successfully.

I have attached the first few lines of my BAM file. Any advice on what might be going wrong would be much appreciated.

Many thanks,
Ximena

head.txt

bam_to_ctss throws error with demo data

I downloaded the genome reference using "scafe.download.resources.genome --genome=hg19.gencode_v32lift37", but get an error when I try to run the demo data with scafe.workflow.sc.solo. I reran prep_genome, as instructed, but the error remains.

#===========================#
Start running bam_to_ctss
#===========================#


=========================================================================
[2021-04-29 17:27] starts running ...... 
=========================================================================

[2021-04-29 17:27] Checking all SCAFE executables          
[2021-04-29 17:27] Checking: bedtools version: 2.30.0
[2021-04-29 17:27] Checking: samtools version: Version: 1.3.1
[2021-04-29 17:27] Checking: paraclu found.
[2021-04-29 17:27] Checking: paraclu-cut found.
[2021-04-29 17:27] Checking: bedGraphToBigWig version: 2.8
[2021-04-29 17:27] Checking: bigWigAverageOverBed version: 2
genome hg19.gencode_v32lift37 does not have chrom_name_path. Please rerun prep_genome step

The error when running the scafe.workflow.sc.pool

Hello, I meet another trouble. Hope you can help me. Thank you in advance.
I want to combine three samples into one pool. Firstly, I create my own lib_list_path.txt. It looks like this.

S17765-F11_S33 1 /storage/yhhuang/users/ruiyan/MS_Cambridge/S17765-F11/S33_SCAFE_out/bam_to_ctss/S17765-F11_S33/bed/S17765-F11_S33.UMI_CB.ctss.bed.gz /storage/yhhuang/users/ruiyan/MS_Cambridge/S17765-F11/S17765-F11_S33_out/outs/filtered_feature_bc_matrix/barcodes.tsv.gz /storage/yhhuang/users/ruiyan/MS_Cambridge/S17765-F11/S33_SCAFE_out/bam_to_ctss/S17765-F11_S33/bed/S17765-F11_S33.CB.ctss.bed.gz
S17765-F11_S34 2 /storage/yhhuang/users/ruiyan/MS_Cambridge/S17765-F11/S34_SCAFE_out/bam_to_ctss/S17765-F11_S34/bed/S17765-F11_S34.UMI_CB.ctss.bed.gz /storage/yhhuang/users/ruiyan/MS_Cambridge/S17765-F11/S17765-F11_S34_out/outs/filtered_feature_bc_matrix/barcodes.tsv.gz /storage/yhhuang/users/ruiyan/MS_Cambridge/S17765-F11/S34_SCAFE_out/bam_to_ctss/S17765-F11_S34/bed/S17765-F11_S34.CB.ctss.bed.gz
S17765-F11_S36 3 /storage/yhhuang/users/ruiyan/MS_Cambridge/S17765-F11/S36_SCAFE_out/bam_to_ctss/S17765-F11_S36/bed/S17765-F11_S36.UMI_CB.ctss.bed.gz /storage/yhhuang/users/ruiyan/MS_Cambridge/S17765-F11/S36_SCAFE_out/bam_to_ctss/S17765-F11_S36/bed/S17765-F11_S36.CB.ctss.bed.gz /storage/yhhuang/users/ruiyan/MS_Cambridge/S17765-F11/S17765-F11_S36_out/outs/filtered_feature_bc_matrix/barcodes.tsv.gz

Then I run the script as this
scafe.workflow.sc.pool --overwrite=yes --lib_list_path=./lib_list_path.txt --genome=hg38.gencode_v32 --run_tag=S17765-F11 --run_outDir=./sc.pool

However, it produces an error like this
image

But I have the count file folder in every sample.
image
I do not what is the origin of question. Hope to get your help. Thank you!

Error in scafe.tool.sc.bam_to_ctss "Use of uninitialized value in division"

Hi,

I've successfully run the demo data and all the software appears to e working. When I ran one of my own samples, however, sc.solo errored out with the following:

Use of uninitialized value in division (/) at /storage/goodell/home/jmreyes/scafe/SCAFE/scripts/scafe.tool.sc.bam_to_ctss line 620.
Illegal division by zero at /storage/goodell/home/jmreyes/scafe/SCAFE/scripts/scafe.tool.sc.bam_to_ctss line 620.
Died at /storage/goodell/home/jmreyes/scafe/SCAFE/scripts/scafe.workflow.sc.solo line 595.

Is there a way to check what may be causing the issue? The bam file was aligned using 'cellranger count' with the SC5P-PE chemistry to mm10.

Thank you,
Jaime

About "mouse ENCODE CREs"

Hi,

I would like to generate reference using "scafe.tool.cm.prep_genome". Could you please tell me where to download mouse ENCODE CREs.

Many thanks in advance!

Jiawei

non-10x compatibility

Hi,

I have a question regarding potential compatibility with non-10x platforms, but still 5' end sequencing. In particular, I have 5' end reads in a sorted .bam file that look like barcode-UMI-GGG-5' end (22 bp - 8 bp - 3 bp - 73 bp for a total of 106 bp).

Will it be possible to run SCAFE in this case? Is there a way to bypass the scafe.tool.cm.remove_strand_invader step in performing scafe.workflow.sc.solo? In addition, what changes might I need to make in scafe.tool.sc.bam_to_ctss in order to correctly identify the 5' end?

Thanks and much appreciated.

scafe.workflow.sc.solo Quitting: Auto detection of TS oligo failed. Average softclip length is ambiguous.

Dear SCAFE team,

I ran into a very unusual error.
For one sample, <scafe.workflow.sc.solo> fails with an error:

#[2023-05-23 07:11] TSO error_rate=0.14030
#Quitting: Auto detection of TS oligo failed. Average softclip length is ambiguous.

#========================================#
#Results of bam_to_ctss not found. Quitting.
#========================================#
#Died at /SCAFE/scripts/scafe.workflow.sc.solo line 595.

Could you suggest how to solve this problem?
cellranger with fastq of this sample works correctly and all downstream analysis of the gene count matrix of this sample does not cause problems.
I run <scafe.workflow.sc.solo> on 18 samples that were sequenced under the same conditions and only one of them gives this error.

Many Thanks,
Marat

failed to run scRNA-seq demo data

Hi,
I just installed this SCAFE to our sever,but I can't get results of the sc.solo/clsuster directory when I run the "scafe.demo.test.run". At the same time I ran the my data that download from the SRA datadase, I also did not get the cluster results. I don't know why, and I have check all the dependencies of this software, I found ererything is ok, so I can't find any error on this. Whould you like to help me to solve this?
This is cluster results,and it is null file.
image
This is the output on my screen.
image
the command is as below:
cd /opt/SCAFE
scafe.demo.test.run --overwrite=yes --run_outDir=./demo/output/

Died scafe.workflow.sc.solo line 556

Dear SCAFE team,

I want to run scafe.workflow.sc.solo with my own data (rat's 5'-scRNA-seq) using the default options.
I successfully created own genome reference package by scafe.tool.cm.prep_genome and run scafe.workflow.sc.solo.
bam_to_ctss, remove_strand_invader, cluster steps run successfully, but I get follow error when filter step starting:

Use of uninitialized value $stdOut in pattern match (m//) at /home/mssabirov/mstool/SCAFE/SCAFE/scripts/scafe.tool.cm.filter line 693.
Rscript is not installed properly. Quitting.

#========================================#
Results of filter not found. Quitting.
#========================================#
Died at /home/mssabirov/mstool/SCAFE/SCAFE/scripts/scafe.workflow.sc.solo line 556.

When I try to run scafe.tool.cm.filter separately, I get same error:
Use of uninitialized value $stdOut in pattern match (m//) at /home/mssabirov/mstool/SCAFE/SCAFE/scripts/scafe.tool.cm.filter line 693.
Rscript is not installed properly. Quitting.

I decided to check proper installation of all dependencies by running scafe.check.dependencies and the following errors were found:
error

I installed SCAFE via singularity.
Would you mind explaining me what I have to do for proper installation of all dependencies?
Thank you in advance.

About scafe.workflow.cm.aggregate

Hi,

  1. When I use "scafe.workflow.cm.aggregate" of new version (v1.0.0), I don't know the format of lib_list_path. Like this?
SRR12443300<\t>SRR12443300.collapse.ctss.bed.gz<\t>SRR12443300.unencoded_G.collapse.ctss.bed.gz
SRR12443301<\t>SRR12443301.collapse.ctss.bed.gz<\t>SRR12443301.unencoded_G.collapse.ctss.bed.gz
SRR12443302<\t>SRR12443302.collapse.ctss.bed.gz<\t>SRR12443302.unencoded_G.collapse.ctss.bed.gz
  1. Which file should I use in the result of scafe.workflow.cm.aggregate? I can't find any file that link each cCREs ID of each library together.

  2. Another problem in scafe.tool.cm.prep_genome. The downloaded reference contains glm directory but self-generating reference doesn't have it. Cound you please tell me how to generate the glm model?

Thanks!

Failed at scafe.workflow.cm.aggregate line 536

Hi,
I am trying to aggregate 3 independent libraries from the outputs of scafe.workflow.sc.solo. I made the lib_list_path file using collapse.ctss.bed.gz and unencoded_G.collapse.ctss.bed.gz paths of all the independent libraries and tried running. But it is throwing an error "unencoded_G.collapse.ctss.bed.gz".
I went through the run and found multiple lines with the similar pattern of failure: "Failed to open file chr9:1-30882864".

This seems to be the common thing for all chr.. .
Any suggestion regarding this? I do not seem to understand the problem here.

Thanks in advance.

SC5P-PE and SC5P-R2

Thanks for the great tool. I have 10 SC5P libraries where 4 were sequenced with SC5P-PE and 6 were sequenced with SC5P-R2. If I run the standard workflow on the SC5P-PE libraries and pool the results to identify the location of tCRE can I then use those locations to count and analyze the SC5P-R2 samples (even though they do not contain the TSO-DNA junction)? Similarly, could you use the tCRE locations from SC5P-PE samples to analyze a SC3P sample (even though it may have less sensitivity)?

Getting starsolo paired end output to clear (preprocessing script attached)

Dropping more notes for posterity.

We recently got some data mapped via starsolo rather than cellranger, following the official --soloBarcodeMate 1 --clip5pNbases 39 0 instructions to get paired end data to work. This creates a soft-clip in the BAM, retaining the entire sequence with the cell/UMI barcodes at the start. Scafe's bam_to_ctss is unhappy with this - it detects the correct offset of ~40 to the TSO, but then promptly forgets it, looks at error rate with an offset of 8, sees a ~70% error rate and explodes. I don't have the exact error on hand anymore as this was a few runs ago.

One way to get the BAM to clear would be to run with --detect_TS_oligo trim and a meaningless 39bp sequence as --TS_oligo_seq. However, if wanting to make use of the power of the various modes, getting the BAM close to the cellranger standard with its 26bp hard clip gets the job done. There's still a 13bp softclip where the TSO is supposed to go, but that might actually be a blessing as it may prevent errant alignment to the reference?

Without further ado, an operational script to get the BAM into that shape:

import pysam

#check if a given bit (from the end, one-indexed) is set in a number's binary representation
#https://www.tutorialspoint.com/check-whether-the-bit-at-given-position-is-set-or-unset-in-python
#k=5 for revcomp (16)
#k=7 for first read (64)
def set_bin_flag(n,k):
   temp = n >> (k - 1)
   if temp & 1:
      return True
   return False

bam = "Aligned.sortedByCoord.out.bam"
infile = pysam.AlignmentFile(bam,"rb")
outfile = pysam.AlignmentFile("scafe.bam", "wb", template=infile)
#I don't think this needs until_eof=True as scafe only cares about aligned reads
for read in infile:
    #is our thing R1, and needs 26bp clipped from somewhere?
    if set_bin_flag(read.flag, 7):
        #it is an R1!
        #we'll need the quality stripped off to a separate variable
        #as per pysam docs, modifying query_sequence resets qualities
        q = read.query_qualities
        #just trimming the sequence/quality isn't enough
        #need to shave 26bp out of the correct bit of cigar
        #this happens within the if
        #is it a revcomped R1?
        if set_bin_flag(read.flag, 5):
            #it is revcomped. trim 26bp from end
            read.query_sequence = read.query_sequence[:-26]
            read.query_qualities = q[:-26]
            #cigar modification time
            #start by sanity checking taht we actually have a softclip (S=4 as per docs) there
            if read.cigartuples[-1][0] != 4:
                raise ValueError(read.query_name+" did not have S in the right place in the cigar string?!")
            #alright, it's an S there, like it should be. shave the range by 26bp
            #need to do a proper full blown assign, it doesn't do partial assigns somehow
            read.cigartuples = read.cigartuples[:-1] + [(read.cigartuples[-1][0], read.cigartuples[-1][1]-26)]
        else:
            #it's forward stranded. trim 26bp from start
            read.query_sequence = read.query_sequence[26:]
            read.query_qualities = q[26:]
            #same story with the cigar, but the first tuple needs to be modified
            if read.cigartuples[0][0] != 4:
                raise ValueError(read.query_name+" did not have S in the right place in the cigar string?!")
            read.cigartuples = [(read.cigartuples[0][0], read.cigartuples[0][1]-26)] + read.cigartuples[1:]
    #the read is now modded, if it needs to be
    outfile.write(read)

#we're done
infile.close()
outfile.close()

The idea is to find R1s, then check if they're aligned to the reverse strand (both of these using SAM flags), then trim 26bp either from the start or the end of the read+CIGAR as appropriate. This actually takes a while to run, think it might be taking longer than the subsequent scafe on my end.

scafe not found when using singularity + docker image

Hi there. Thanks for a great resource. We are excited to apply SCAFE to our data.

I am trying to run SCAFE using the Docker image pulled from https://hub.docker.com/r/cchon/scafe using singularity:
singularity pull docker://cchon/scafe:latest

However, when I try to use the docker image, SCAFE commands are not recognized. For example, running a shell:
>singularity shell scafe_latest.sif
>scafe_latest.sif scafe.workflow.sc.solo --help
returns this error:
bash: scafe.workflow.sc.solo: command not found

And executing the command:
>singularity exec scafe_latest.sif scafe.workflow.sc.solo --help
returns this error:
/.singularity.d/actions/exec: 21: exec: scafe.workflow.sc.solo: not found

Some googling suggests that this could happen if the image does not have /bin/sh in it (see apptainer/singularity#2386)... in other words maybe it is a docker to singularity issue? Do you know if your docker container should be compatible in singularity?

Thanks for any help!

Reference handling in an automated, multi-sample container setting (with workarounds)

The SCAFE Docker image does not come with the references pre-downloaded, which is understandable. However, it expects the reference positioned in a very particular location within the container files, necessitating a fresh download each time the container is started up. This becomes doubly problematic when converting to Singularity, which is read-only (just adding --writable as indicated in the readme does not work) and my attempts to make it not be so resulted in conflicts with Nextflow.

The easiest resolution would be to allow specifying a path to the reference. This way it could be pre-downloaded ahead of time and provided directly to the container as part of the mount, simplifying the process considerably.

Current workarounds follow.

In Docker, the easiest way is to start an interactive session with a mount specified via -v, download the reference of interest, and then copy it to the mounted directory. Then the reference can be passed back in as part of the contents of the -v'd folder, and a symlink can be easily set up to the appropriate directory within the script:

#add scafe scripts to path
source ~/.bashrc
#symlink in the referece
ln -s /mnt/hg38.gencode_v32 resources/genome
#do your scafe stuff

For Singularity, the workaround I found involved turning the container to a sandbox, downloading the reference, and turning it back to a .sif file:

#get sif from docker
singularity pull docker://cchon/scafe:latest
#turn to sandbox to allow addition of files
singularity build --sandbox scafe-sandbox scafe_latest.sif
#start up sandboxed version which will allow storing stuff, with --writable
singularity shell --writable --env APPEND_PATH=/SCAFE/scripts --env LC_ALL=C scafe-sandbox
#run within container, download reference
scafe.download.resources.genome --genome=hg38.gencode_v32
#exit container, turn sandbox back into sif
singularity build scafe_hg38.sif scafe-sandbox/

The container can then be called like so, with the reference baked into the appropriate location:

singularity shell --env APPEND_PATH=/SCAFE/scripts --env LC_ALL=C scafe_hg38.sif

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.