Code Monkey home page Code Monkey logo

liulab-dfci / maestro Goto Github PK

View Code? Open in Web Editor NEW
271.0 12.0 75.0 208.21 MB

Single-cell Transcriptome and Regulome Analysis Pipeline

License: GNU General Public License v3.0

Python 12.51% R 5.77% HTML 5.32% Batchfile 0.01% Shell 0.81% Makefile 0.85% C 70.02% JavaScript 2.03% CSS 0.37% Roff 0.52% M4 0.33% Perl 0.84% Scilab 0.07% Ruby 0.46% Dockerfile 0.11%
scatac-seq scrna-seq epigenetics snakemake-workflows transcription-factors gene-regulatory-networks

maestro's Introduction

MAESTRO

GitHub GitHub release (latest by date) Conda Docker Pulls Build Status

MAESTRO(Model-based AnalysEs of Single-cell Transcriptome and RegulOme) is a comprehensive single-cell RNA-seq and ATAC-seq analysis suit built using snakemake. MAESTRO combines several dozen tools and packages to create an integrative pipeline, which enables scRNA-seq and scATAC-seq analysis from raw sequencing data (fastq files) all the way through alignment, quality control, cell filtering, normalization, unsupervised clustering, differential expression and peak calling, celltype annotation and transcription regulation analysis. Currently, MAESTRO support Smart-seq2, 10x-genomics, Drop-seq, SPLiT-seq for scRNA-seq protocols; microfudics-based, 10x-genomics and sci-ATAC-seq for scATAC-seq protocols.

Documentation

We are hosting MAESTRO documentation, instruction and tutorials at MAESTRO Website.

Change Log

v1.0.0

  • Release MAESTRO.

v1.0.1

  • Provide docker image for easy installation. Note, the docker does not include cellranger/cellranger ATAC, as well as the corresponding genome index. Please install cellranger/cellranger ATAC following the installation instructions.

v1.0.2

  • Fix some bugs and set LISA as the default method to predict transcription factors for scRNA-seq. Note, the docker includes the LISA conda environment, but does not include required pre-computed genome datasets. Please download hg38 or mm10 datasets and update the configuration following the installation instructions.

v1.1.0

  • Change the default alignment method of MAESTRO from cellranger to starsolo and minimap2 for accelerating the mapping time.
  • Improve the memory efficiency of scATAC gene score calculation.
  • Incorporate the installation of giggle into MAESTRO, and add web API for LISA function. All the core MAESTRO function can be installed through the conda environment now!
  • Provide more documents for the QC parameters and add flexibility for other parameters in the workflow.

v1.2.0

  • Modify the regulatory potential model by removing the interfering peaks from adjacent genes and adjusting the weight of exon peaks. The "enhanced RP model" is set as the default gene activity scoring model with original "simple RP model" as a option.
  • Modify the integration function of MAESTRO. The new function can output more intermediate figures and log files for diagnosing the possible failure in integrating rare populations.
  • Add the function for annotating cell-types for scATAC-seq clusters based on public bulk chromatin accessibility data from Cistrome database (Note: Please update the giggle index to the latest.
  • Provide the function of generating genome browser tracks at cluster level for scATAC-seq dataset visualization.
  • Support peak calling at the cluster level now!

v1.2.1

  • For scATAC, MAESTRO can support fastq, bam, fragments.tsv.gz as the input of the scATAC-seq workflow.
  • For scATAC, MAESTRO provides an option for users to skip the cell-type annotation step in the pipeline, and an option to choose the strategy for cell-type annotation (RP-based and peak-based).
  • Provide small test data for test pipeline (sampling from 10x fastq files).
  • Add parameter validation before initializing the pipeline and provide more gracious error messages.
  • Update R in MAESTRO conda package from 3.6.3 to 4.0.2, and Seurat from 3.1.2 to 3.1.5.

v1.2.1.9999

  • Bug fixes (placeholder for v1.2.2 formal release)

v1.2.2

  • Update LISA to LISA2 which extends the original, runs faster, reduces dependencies, and adds useful CLI functions for pipeline integration. Please download the LISA2 data from human and mouse.
  • Update conda dependencies to only requesting lowest versions.
  • Fix the bugs in conda package installation channel.
  • Update markers in the mouse.brain.ALLEN cell signature file.
  • Fix the bugs to support the 10X .h5 file as the input format of MAESTRO scatac-genescore.
  • Rename 'Adjusted RP model' to 'Enhanced RP model'.
  • Fix the bugs in Snakefile to meet the latest version of snakemake.
  • Update STAR reference indexes files for STAR -version 2.7.6a. Provide pre-built indexes for human and mouse.

v1.3.0

  • scATAC-seq multi-sample pipeline enabled. Deduplication can be set at cell or bulk level.
  • Peak count matrix can be generated either on binary or raw count.
  • LISA2 data will only be configured once in a given environment.
  • Update web links for downloading reference data.

v1.3.1

  • Fix the bug in raw peak count matrix generation.

v1.3.2

  • Move from TravisCI to GitHub Actions for building package.
  • LISA2 upgrades to v2.2.2. New LISA data are required for human and mouse. Please download to your computer and provide the path when initiating.
  • Add LISA path as a variable in TF annotating function.
  • Reduce the time and memory usage in the peak counting step.
  • Fix the bug in the simple RP model for gene score calculation.
  • Fix the bug in scATAC-Seq Snakefile.

v1.4.0

  • Upgrade dependencies to Seurat 4.0.0 and Signac 1.1.1.
  • Set QC; genes mapped to mitochondrial as a variable in the scRNA-seq analysis.
  • Add output path as variables in the MAESTRO R package.

v1.5.0

  • Support processing multi-samples for both scRNA-seq and scATAC-seq.
  • Integrate chromap as the default aligner for scATAC-seq.
  • Support multi-sample scATAC-seq when starting from bam file with CB tag or 10X like fragment file.
  • Fix the ratio of genes mapped to mitochondrial to percentage.
  • Move MAESTRO documentation to workfowr.

v1.5.1

  • Expand STARsolo --soloFeatures and --runThreadN as MAESTRO subcommands.
  • Support single-nuclei RNA-seq pipeline.
  • Fix bug in sample initiation subcommand to read fastq with sample id greater than 9.
  • Update MAESTRO documentation to v1.5.1. Add snRNA-seq tutorials. Expand scRNA-seq tutorial with lisa2 TF prediction custom analysis. Add multi-scATAC-seq genome track plot for pseudobulk peaks. Explain multi-samples peak calling parameters.

System requirements

  • Linux/Unix
  • Python (>= 3.7) for MAESTRO snakemake workflow
  • R (>= 4.0.2) for MAESTRO R package

Installation

Install MAESTRO

There are two ways to install MAESTRO -- to install the full workflow through Anaconda cloud; or to install only the R codes for exploring the processed data.

Installing the full solution of MAESTRO workflow through conda

MAESTRO uses the Miniconda3 package management system to harmonize all of the software packages. Users can install the full solution of MAESTRO using the conda environment.

Use the following commands to install Minicoda3:

$ wget https://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86_64.sh
$ bash Miniconda3-latest-Linux-x86_64.sh

And then users can create an isolated environment for MAESTRO and install through the following commands:

$ conda config --add channels defaults
$ conda config --add channels liulab-dfci
$ conda config --add channels bioconda
$ conda config --add channels conda-forge
# To make the installation faster, we recommend using mamba
$ conda install mamba -c conda-forge
$ mamba create -n MAESTRO maestro=1.5.1 -c liulab-dfci
# Activate the environment
$ conda activate MAESTRO

Installing the MAESTRO R package from source code

If users already have the processed datasets, like cell by gene or cell by peak matrix generate by Cell Ranger. Users can install the MAESTRO R package alone to perform the analysis from processed datasets.

$ R
> library(devtools)
> install_github("liulab-dfci/MAESTRO")

Citation

Wang C, Sun D, Huang X, Wan C, Li Z, Han Y, Qin Q, Fan J, Qiu X, Xie Y, Meyer CA, Brown M, Tang M, Long H, Liu T, Liu XS. Integrative analyses of single-cell transcriptome and regulome using MAESTRO. Genome Biol. 2020 Aug 7;21(1):198. doi: 10.1186/s13059-020-02116-x. PMID: 32767996; PMCID: PMC7412809.

maestro's People

Contributors

baigal628 avatar chenfeiwang avatar crazyhottommy avatar dongqingsun avatar dongqingsun96 avatar kant avatar liulab-dfci avatar lzy604 avatar mengxiao avatar qiangtu avatar taoliu avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

maestro's Issues

Bug: Bugs of relative pathes in the MAESTRO scrna-/scatac-init scripts and Errors in Tutorial

The version of MAESTRO: v1.1.0

Command line bugs:

  1. If --fastq-dir ends with /, the snakemake cmd will give a ‘double /’ warning.
  2. If --fastq-dir is a relative path, it won’t work since this relative path will be added to the config file in a new child directory; the same problem can be found for the option for whitelist file.

Tutorial errors:

  1. The instruction, there is no --directory option but -d.
  2. The assembly names GRCh38 or GRCm38 are not species names!
  3. Please do not hardcode path names in instruction.
  4. Errors in the scRNA instruction: the download link for the STAR index only contains human data. The RSEM library download link is pointing to giggle download.

Why I can't open the HTML report

Several sets of data were tested, and it was very convenient.
When I open HTML report, but I can't see any contents of the report only sample information.
Is there anything I haven't noticed?
image

Bug: bdg to bw in Snakemake

The current implementation has a potential problem. The bedtools intersect with -f 1.00 will get rid of the region that is not ENTIRELY included in the chr_limit file. It would be more reasonable to clip instead of discarding the whole region.

rule scatac_bdg2bw:
input:
bdg = "Result/Analysis/" + config["outprefix"] + "_all_treat_pileup.bdg",
chrlimit = SCRIPT_PATH + "/annotations/" + config["species"] + "_chr_limit.bed",
chrlen = SCRIPT_PATH + "/annotations/" + config["species"] + "_chr.len",
output:
tmp = "Result/Analysis/" + config["outprefix"] + "_all_treat_pileup.bdg.tmp",
bigwig = "Result/Analysis/" + config["outprefix"] + "_all.bw"
params:
outdir = "Result/Analysis/" + config["outprefix"]
shell:
"bedtools intersect -a {input.bdg} -b {input.chrlimit} -wa -f 1.00 > {output.tmp};"
"bedSort {output.tmp} {input.bdg};"
"bedGraphToBigWig {input.bdg} {input.chrlen} {output.bigwig};"

Check my gist: https://gist.github.com/taoliu/2469050

lisa unpack error

Hi, I am running MAESTRO 1.2.2 following the tutorial. The scRNA-seq module complained:

lisa unpack: error: argument tarball: ERROR: /home/qtu/work/maestro/lisa/hg38_2.1.tar.gz is not a valid file.

Actually I could directly execute lisa unpack hg38_2.1.tar.gz successful, not sure what caused the error.

BTW, does the option --lisadir lisa/hg38_2.1.tar.gz mean the pipeline will unpack the huge file each time?

Thanks!

Enhancement: Suggestions on MAESTRO/R/ATACRunSeurat.R

If we look at MAESTRO/R/ATACRunSeurat.R script, it includes many of Seurat functions, such as RunLSI, FindCluster, and RunUMAP. However, the parameters for these functions are mainly fixed with only few tunable parameters for users. How can users tweak the parameter such as n.neighbors or min.dist for UMAP with this script, or the maximum number of components from LSI (n=50 is fixed)? My suggestion is to break this big script into at least three modules -- LSI, UMAP, and DE, and provide more options!

Bug: scATAC_genescore.py and ATACCalculateGenescore.R can't handle peaks from unusual chromosomes

When peaks are from unusual chromosomes such as "chr11_KI270721v1_random", scATAC_genescore.py and ATACCalculateGenescore.R can't correctly split the peak name such as "chr11_KI270721v1_random_12345_67890" to find the chromosome name, start and end positions since it will use "_" as the separator:

for ipeak, peak in enumerate(peaks_list):
peaks_tmp = peak.decode().split("_")
peaks_info.append([peaks_tmp[0], (int(peaks_tmp[1]) + int(peaks_tmp[2])) / 2.0, 0, ipeak])

Please use rsplit("_",maxsplit=2) instead. Here is an example:

>>> a="chr11_KI270721v1_random_12345_67890"
>>> a.split("_") # this will incorrectly split the chromosome name
['chr11', 'KI270721v1', 'random', '12345', '67890']
>>> a.rsplit("_",maxsplit=2) # this will give exactly three values
['chr11_KI270721v1_random', '12345', '67890']

Question: Can we do TF annotation without doing cell annotation?

In some cases, we do not know the actual cell types, or we do not know the signature genes of possible cell types, so we'd like to skip cell-type annotation step, and plan to identify the driver TFs for each cluster. Can we do that? I am asking this since according to the tutorial, the regulation potentials (RP, or gene scores) are only to be included in the 'cell type annotation' step, but RPs are needed in the TF identification step.

install_github("liulab-dfci/MAESTRO") error

Dear Tao,

When I ran install_github("liulab-dfci/MAESTRO"), I encountered the below error. Have you ever encountered the same error?

Error: Failed to install 'unknown package' from GitHub:
Failed to connect to api.github.com port 443: Connection refused

Thanks very much for your help.

Best regards,
Leon.

error at scatac_qcfilter step

Hi,
I got the error below when running the pipeline starting from bam. Any suggestion what I should try to fix this? Also I wonder if the pipeline can be started at any point. In my case here, I would like to resume the pipeline from this scatac_qcfilter step once I figure out the problem, rather than starting over the whole pipeline again.

rule scatac_qcfilter:
input: Result/Analysis/SI_25417_peak_count.h5
output: Result/QC/SI_25417_filtered_peak_count.h5
jobid: 4
benchmark: Result/Benchmark/SI_25417_QCFilter.benchmark

Traceback (most recent call last):
File "/root/miniconda3/envs/MAESTRO/bin/MAESTRO", line 4, in
import('pkg_resources').run_script('MAESTRO==1.2.1', 'MAESTRO')
File "/root/miniconda3/envs/MAESTRO/lib/python3.7/site-packages/pkg_resources/init.py", line 665, in run_script
self.require(requires)[0].run_script(script_name, ns)
File "/root/miniconda3/envs/MAESTRO/lib/python3.7/site-packages/pkg_resources/init.py", line 1463, in run_script
exec(code, namespace, namespace)
File "/root/miniconda3/envs/MAESTRO/lib/python3.7/site-packages/MAESTRO-1.2.1-py3.7.egg-info/scripts/MAESTRO", line 112, in
main()
File "/root/miniconda3/envs/MAESTRO/lib/python3.7/site-packages/MAESTRO-1.2.1-py3.7.egg-info/scripts/MAESTRO", line 94, in main
scatac_qc(directory = args.directory, outprefix = args.outprefix, fileformat = args.format, peakcount = args.peakcount, feature = args.feature, barcode = args.barcode, single_stat = args.single_stat, peak_cutoff = args.peak_cutoff, count_cutoff = args.count_cutoff, frip_cutoff = args.frip_cutoff, cell_cutoff = args.cell_cutoff, species = args.species)
File "/root/miniconda3/envs/MAESTRO/lib/python3.7/site-packages/MAESTRO/scATAC_QC.py", line 147, in scatac_qc
Filter(rawmatrix = peakmatrix, feature = features, barcode = barcodes, peak_cutoff = peak_cutoff, cell_cutoff = cell_cutoff, validcell = validcells_list, outprefix = filename, species = species)
File "/root/miniconda3/envs/MAESTRO/lib/python3.7/site-packages/MAESTRO/scATAC_QC.py", line 84, in Filter
passed_cell_matrix = rawmatrix[np.ix_(passed_peak.tolist()[0], passed_cell.tolist()[0])]
File "/root/miniconda3/envs/MAESTRO/lib/python3.7/site-packages/scipy/sparse/_index.py", line 75, in getitem
return self._get_arrayXarray(row, col)
File "/root/miniconda3/envs/MAESTRO/lib/python3.7/site-packages/scipy/sparse/compressed.py", line 665, in _get_arrayXarray
major.size, major.ravel(), minor.ravel(), val)
ValueError: could not convert integer scalar
[Wed Sep 2 20:52:51 2020]
Error in rule scatac_qcfilter:
jobid: 4
output: Result/QC/SI_25417_filtered_peak_count.h5
shell:
MAESTRO scatac-qc --format h5 --peakcount Result/Analysis/SI_25417_peak_count.h5 --peak-cutoff 100 --cell-cutoff 10 --directory Result/QC --outprefix SI_25417
(one of the commands exited with non-zero exit code; note that snakemake uses bash strict mode!)

Thanks,
Yuping

possible typo in the ATAC instruction page

Hi,

I ran MAESTRO by the Docker image successfully, then I tried to explore the data using R following the instruction. In the step 4, there is:

p <- VisualizeTFenrichment(TFs = tfs,
                             cluster.1 = 0,
                             type = "ATAC",
                             SeuratObj = pbmc.ATAC.res$ATAC,
                             GIGGLE.table = "10X_PBMC_10k_giggle.txt",
                             visual.totalnumber = 100,
                             name = "10X_PBMC_10k_Monocyte_filtered")

I think the GIGGLE.table should be '10X_PBMC_10k.PredictedTFScore.txt', because in the directory, there is no such file named '10X_PBMC_10k_giggle.txt'.

Thanks!

gene_peak matrix in RP_model

The value range of gene_peak matrix in RP_model should between 0 and 1. Why does "genes_peaks_score_csr" in scATAC_Genescore.py have a value greater than 1, with a maximum of 1600+.

scrna-init creates a problematic snakefile for 10x-genomics

I used the following command to create a Snakefile following the instructions at https://github.com/liulab-dfci/MAESTRO/blob/master/example/RNA_infrastructure_10x/RNA_infrastructure_10x.md.

Unfortunately, the Snakefile does not work since I get the following error:

IndexError in line 34 of /storage/mathelierarea/processed/anthoma/Projects/single_cell/results/20201216_maestro/10X_PBMC_RNA_8k_MAESTRO_V122/Snakefile:
list index out of range
  File "/storage/mathelierarea/processed/anthoma/Projects/single_cell/results/20201216_maestro/10X_PBMC_RNA_8k_MAESTRO_V122/Snakefile", line 34, in <module>
  File "/lsc/MAESTRO/1.2.2/envs/MAESTRO/lib/python3.8/site-packages/MAESTRO/scRNA_utility.py", line 44, in getfastq_10x

The problem comes from the fact that the scrna-init command did not provide values for the transcript, barcode, and decompress variable; and these variables are used in the Snakefile for the STAR command.

This should not be the case given your documentation that specifies that --fastq-barcode and --fastq-transcript should only be provided for the Dropseq format (not 10x-genomics)

You can find the Snakefile and config.xml files in the archive enclosed.
snakefile_and_config.zip

cluster resolution other than 0.6 with error in coembedding

Hello,

I tried to cluster my scRNA and scATAC datasets using a resolution other than 0.6 (this number was used in the MAESTRO vignettes), but I got an error saying that "ATAC_snn_res.0.6" or "RNA_snn_res.0.6" cannot be found in the Seurat object when I used the "Incorporate" function to coembed the two datasets. Do I need to use a default resolution of 0.6 before coembedding? But can I change the resolution after coembedding?

Thank you!

Is MAESTRO compatible with 10X data derived from nuclear RNA?

Hello,

I'm currently installing the MAESTRO prerequisites and, after reading the paper, I'd like to ask if MAESTRO is compatible with 10X data derived from nuclear RNA, particularly if I'm looking to integrate single-modal snRNA- and snATAC-seq data?

And more specifically, could the use of a pre-mRNA reference and GTF files for alignment, as opposed to standard reference/annotation files, impact a MAESTRO analysis at all?

Until now I have been using Cell Ranger 4 for my analysis which recommends using a pre-mRNA reference and GTF file for nuclear RNA. I had started creating STARsolo compatible versions of these files for my MAESTRO analysis and wondered if this is the best course of action, particularly as 10X have recently released v5 which includes a new function for dealing with intronic reads without the need of a pre-RNA reference, and STARsolo also provides a similar function.

Regardless, it would be useful to hear if you have any recommendations or points of interest that I should consider when running MAESTRO using single-nuclear data.

Many Thanks,

Darren

Multi-sample analysis!

Hi!
A very good tool.
I now have five scATAC samples.How do I merge multiple samples?

FileNotFoundError: [Errno 2] No such file or directory: 'Result/Analysis/M1R1_MAESTRO.PredictedTFTop10.txt'

Hi Chenfei,

Thank you for making MAESTRO available! I encountered an error while processing some 10x scRNA-seq data from mice.

It seems that MAESTRO couldn't find PredictedTFTop10.txt. I have RABIT installed and specified the RABIT index in the yaml file.

[Mon Dec  9 15:03:42 2019]
rule scrna_report:
    input: Result/Analysis, Result/QC/M1R1_MAESTRO_scRNA_read_distr.png, Result/QC/M1R1_MAESTRO_scRNA_read_quality.png, Result/QC/M1R1_MAESTRO_scRNA_NVC.png, Result/QC/M1R1_MAESTRO_scRNA_GCcontent.png, Result/QC/M1R1_MAESTRO_scRNA_genebody_cov.png, Result/QC/M1R1_MAESTRO_scRNA_cell_filtering.png
    output: Result/M1R1_MAESTRO_scRNA_report.html
    jobid: 1

Traceback (most recent call last):
  File "/omg/home/username/miniconda3/envs/MAESTRO/lib/python3.7/site-packages/MAESTRO/scRNA_report.py", line 36, in <module>
    for line in open(cluster_regulator_file,"r").readlines():
FileNotFoundError: [Errno 2] No such file or directory: 'Result/Analysis/M1R1_MAESTRO.PredictedTFTop10.txt'
[Mon Dec  9 15:03:42 2019]
Error in rule scrna_report:
    jobid: 1
    output: Result/M1R1_MAESTRO_scRNA_report.html
    shell:
        python /omg/home/username/miniconda3/envs/MAESTRO/lib/python3.7/site-packages/MAESTRO/scRNA_report.py M1R1_MAESTRO /home/username/Projects/maestro_eval/results/2019-12-06-first/M1R1_MAESTRO/M1R1 GRCm38 10x-genomics True

Shutting down, this might take some time.
Exiting because a job execution failed. Look above for error message
Complete log: /home/username/Projects/maestro_eval/results/2019-12-06-first/M1R1_MAESTRO/.snakemake/log/2019-12-09T150340.656779.snakemake.log

This appears to the last missing step in MAESTRO as rerunning the pipeline gave me:

Building DAG of jobs...
Using shell: /bin/bash
Provided cores: 8
Rules claiming more threads will be scaled down.
Job counts:
        count   jobs
        1       all
        1       scrna_report
        2

A hardcoded path of LISA in the docker image

Hello,

I am using MAESTRO docker image winterdongqing/maestro, 1.2.1, 57f41ffbe200. When I ran the scRNA-seq module, it reported errors like:

unable to open file: name = '/project/dev/qqin/LISA/lisa_web/cistromedb_data/hg38/cluster_human/DNase_median_for_each_cluster.h5'

This error should be caused by a hardcoded path of LISA. I changed to web LISA, then this module finished successfully.

Bug: Missing default value or required status for `MAESTRO init` command

The main executable script MAESTRO will throw exceptions when no argument is set for init command:

$ MAESTRO init 
Traceback (most recent call last):
  File "/mnt/lustre/users/tliu/miniconda3/envs/MAESTRO/bin/MAESTRO", line 4, in <module>
    __import__('pkg_resources').run_script('MAESTRO==1.0.1', 'MAESTRO')
  File "/mnt/lustre/users/tliu/miniconda3/envs/MAESTRO/lib/python3.7/site-packages/pkg_resources/__init__.py", line 667, in run_script
    self.require(requires)[0].run_script(script_name, ns)
  File "/mnt/lustre/users/tliu/miniconda3/envs/MAESTRO/lib/python3.7/site-packages/pkg_resources/__init__.py", line 1463, in run_script
    exec(code, namespace, namespace)
  File "/mnt/lustre/users/tliu/miniconda3/envs/MAESTRO/lib/python3.7/site-packages/MAESTRO-1.0.1-py3.7.egg-info/scripts/MAESTRO", line 79, in <module>
    main()
  File "/mnt/lustre/users/tliu/miniconda3/envs/MAESTRO/lib/python3.7/site-packages/MAESTRO-1.0.1-py3.7.egg-info/scripts/MAESTRO", line 72, in main
    init_workflow(args.directory, args.module)
  File "/mnt/lustre/users/tliu/miniconda3/envs/MAESTRO/lib/python3.7/site-packages/MAESTRO-1.0.1-py3.7.egg-info/scripts/MAESTRO", line 21, in init_workflow
    os.makedirs(directory)
  File "/mnt/lustre/users/tliu/miniconda3/envs/MAESTRO/lib/python3.7/os.py", line 206, in makedirs
    head, tail = path.split(name)
  File "/mnt/lustre/users/tliu/miniconda3/envs/MAESTRO/lib/python3.7/posixpath.py", line 107, in split
    p = os.fspath(p)
TypeError: expected str, bytes or os.PathLike object, not NoneType

The solution is either to add required=True for "-m" and "-d", or to set default= for these options.

MAESTRO/MAESTRO/MAESTRO

Lines 61 to 62 in 1eba0a2

workflow.add_argument("-d", "--directory", help = "Path to the directory where the workflow shall be initialized and results shall be stored.")
workflow.add_argument("-m", "--module", choices = ["scATAC","scRNA","integrate"], help="Analysis of scATAC-seq or scRNA-seq, or integrattion analysis")

Bug: Hardcoded separator "_" doesn't work for 10X H5 format, in scATAC_genescore.py

The following line for calculating RP scores assumes that peaks are named as "chr_start_end" in peak matrix.

peaks_tmp = peak.decode().split("_")

However, the standard output "filtered_peak_bc_matrix.h5" from 10X cellranger-atac may use chr:start-end", according to

https://support.10xgenomics.com/single-cell-atac/software/pipelines/latest/advanced/h5_matrices

Suggest replacing the separator in the peak column to a standard character (either "_" or "-") in MAESTRO R script after loading the H5 file.

Python scripts should use argparse

Scripts like scRNA_qc.py use many positional arguments, which is not particularly user friendly if the user needs to run the script outside of the snakemake pipeline. It would be easier for the user if argparse was used so that they can specify command line flags and have a help doc for the script.

Enhancement: Documentation of APIs

We need documentation of APIs. Since users need to know the options of each function, the data structures, in order to fine-tune their pipeline or skip certain steps. Currently, the only available documentation is the tutorial, which is not enough.

giggle and lisa tables for drosophila

Is it possible to generate the giggle and lisa tables for drosophila melanogaster? I would like to use these to generate the driver TFs for both scRNA and scATAC-seq samples.

Multiple Samples

How would one go about using MAESTRO (starting at Step 4 of the pipeline) with multiple sc-RNA-Seq samples?

Pass in existing Seurat/SingleCellExperiment Object

Is it possible to pass in an existing Seurat object (or converted SingleCellExperiment object)? I have a pretty robust pipeline that does QC, clustering and such, would like to have some level of preservation of those clusters and such.

Suggestion to set default assay, fix hard coded lines and more in Incorporate.R

To make this script more flexible for customized analysis, please add DefaultAssay(ATAC) <- "ACTIVITY" to the end of this code block.

MAESTRO/R/Incorporate.R

Lines 47 to 64 in d58fc18

if(is.null(ATAC[["ACTIVITY"]])&method!="Seurat"){
RPmatrix <- RPmatrix[,intersect(colnames(ATAC), colnames(RPmatrix))]
ATAC <- SubsetData(ATAC, cells = intersect(colnames(ATAC), colnames(RPmatrix)))
ATAC[["ACTIVITY"]] <- CreateAssayObject(counts = RPmatrix)
DefaultAssay(ATAC) <- "ACTIVITY"
ATAC <- FindVariableFeatures(ATAC)
ATAC <- NormalizeData(ATAC)
ATAC <- ScaleData(ATAC)}
if(method=="Seurat"){
activity.matrix <- CreateGeneActivityMatrix(peak.matrix = RPmatrix, annotation.file = "/homes/cwang/annotations/hg38/Homo_sapiens.GRCh38.98.gtf", seq.levels = c(1:22, "X", "Y"), upstream = 2000, verbose = TRUE)
activity.matrix <- activity.matrix[,intersect(colnames(ATAC), colnames(activity.matrix))]
ATAC <- SubsetData(ATAC, cells = intersect(colnames(ATAC), colnames(activity.matrix)))
ATAC[["ACTIVITY"]] <- CreateAssayObject(counts = activity.matrix)
DefaultAssay(ATAC) <- "ACTIVITY"
ATAC <- FindVariableFeatures(ATAC)
ATAC <- NormalizeData(ATAC)
ATAC <- ScaleData(ATAC)}

Also, please fix the hard code at:

activity.matrix <- CreateGeneActivityMatrix(peak.matrix = RPmatrix, annotation.file = "/homes/cwang/annotations/hg38/Homo_sapiens.GRCh38.98.gtf", seq.levels = c(1:22, "X", "Y"), upstream = 2000, verbose = TRUE)

p2 <- DimPlot(CombinedObj, reduction = "umap", group.by = "RNA_snn_res.0.6", cells = rownames(CombinedObj@meta.data[which(CombinedObj@meta.data[,'tech']=='RNA'),]), label = TRUE, repel = TRUE)

p3 <- DimPlot(CombinedObj, reduction = "umap", group.by = "ATAC_snn_res.0.6", cells = rownames(CombinedObj@meta.data[which(CombinedObj@meta.data[,'tech']=='ATAC'),]), label = TRUE, repel = TRUE)

Another suggestion is to separate the plotting functions from this script since when users get the CombinedObj, they can plot whatever they want. To combine plotting with computation is unnecessary and will potentially cause some bugs.

Bug: incorrect use of MACS2 -n option

As shown in the current code to invoke macs2:

rule scatac_allpeakcall:
input:
bam = "Result/BWA/" + config["outprefix"] + ".merged.sortedByPos.rmdp.bam"
output:
peak = "Result/Analysis/" + config["outprefix"] + "_all_peaks.narrowPeak",
bdg = "Result/Analysis/" + config["outprefix"] + "_all_treat_pileup.bdg",
params:
name = "Result/Analysis/" + config["outprefix"] + "_all"
log:
"Result/Log/" + config["outprefix"] + "_macs2_allpeak.log"
conda:
ENV_PATH + "/environment_py2.yaml"
shell:
"macs2 callpeak -g hs -n {params.name} -B -q 0.05 --nomodel --extsize=50 --SPMR -t {input.bam}"

The option '-n' contains a folder structure. This is not a recommended way to specify an output directory of macs2. Please use '--outdir' instead and keep the '-n' simple such as:

name = config["outprefix"] + "_all"

and

macs2 callpeak -g hs --outdir Result/Analysis -n {params.name} -B -q 0.05 --nomodel --extsize=50 --SPMR -t {input.bam}

The issue is also related to other places in this Snakemake file where MACS2 is invoked.

Dockerfile recipe error

I'm currently hitting this error on the last step of the Dockerfile recipe:

Step 10/10 : RUN cd /root/Annotation && tar -xzvf rabit.tar.gz && tar -xzvf giggle.tar.gz && rm rabit.tar.gz && rm giggle.tar.gz
 ---> Running in a935b7a7a681
tar (child): rabit.tar.gz: Cannot open: No such file or directory
tar (child): Error is not recoverable: exiting now
tar: Child returned status 2
tar: Error is not recoverable: exiting now
The command '/bin/sh -c cd /root/Annotation && tar -xzvf rabit.tar.gz && tar -xzvf giggle.tar.gz && rm rabit.tar.gz && rm giggle.tar.gz' returned a non-zero code: 2

Bug: Let reticulate find the current Python

If users use MAESTRO for the first time or more accurately use "reticulate" package through ATACCalculateGenescore script for the first time, they will be prompted to decide whether "reticulate" should install an independent miniconda environment by itself.

No non-system installation of Python could be found.
Would you like to download and install Miniconda?
Miniconda is an open source environment management system for Python.
See https://docs.conda.io/en/latest/miniconda.html for more details.

Would you like to install Miniconda? [Y/n]: n

Assuming users are utilizing conda/miniconda version of MAESTRO for their analysis, they should stick to the python in that environment, so answer "n". This has to be done just once. We'd better describe it in our tutorial otherwise users may have a lot of trouble later on if they accidentally answer "Y" to the above question. See the discussion regarding this in rstudio/reticulate#607.

The code related to the issue is

library(reticulate)
source_python(paste(system.file(package="MAESTRO"), "ATACCalculateGenescore.py", sep="/"))

Install MAESTRO version 1.2.1.9999

I followed the instruction https://github.com/liulab-dfci/MAESTRO to install MAESTRO on HPC cluster for our institution research community.

My python version is 3.7.3-anaconda; conda version 4.9.2-py37h89c1867_0

I followed the instruction “ Installing the full solution of MAESTRO workflow through conda “, here is my procedures

[ris_hpc_apps@tdragon4 ~]$ module load python/3.7.3-anaconda

[ris_hpc_apps@tdragon4 ~]$ which conda

/risapps/rhel7/python/3.7.3/bin/conda

[ris_hpc_apps@tdragon4 ~]$ conda config --add channels defaults

Warning: 'defaults' already in 'channels' list, moving to the top

[ris_hpc_apps@tdragon4 ~]$ conda config --add channels bioconda

Warning: 'bioconda' already in 'channels' list, moving to the top

[ris_hpc_apps@tdragon4 ~]$ conda config --add channels conda-forge

Warning: 'conda-forge' already in 'channels' list, moving to the top

   -- no need to add those channels; they are already there.

 

[ris_hpc_apps@tdragon4 ~]$ conda install mamba -c conda-forge

Collecting package metadata (repodata.json): done

Solving environment: done

## Package Plan ##

 

  environment location: /risapps/rhel7/python/3.7.3

  added / updated specs:

    - mamba

The following packages will be downloaded:

 

    package                    |            build

    ---------------------------|-----------------

    ca-certificates-2020.11.8  |       ha878542_0         145 KB  conda-forge

    certifi-2020.11.8          |   py37h89c1867_0         150 KB  conda-forge

    conda-4.9.2                |   py37h89c1867_0         3.0 MB  conda-forge

    libsolv-0.7.16             |       h8b12597_0         455 KB  conda-forge

    ------------------------------------------------------------

                                           Total:         3.8 MB

 

The following NEW packages will be INSTALLED:

  libsolv            conda-forge/linux-64::libsolv-0.7.16-h8b12597_0

  mamba              conda-forge/linux-64::mamba-0.1.2-py37h99015e2_0

 

The following packages will be UPDATED:

  ca-certificates                      2020.6.20-hecda079_0 --> 2020.11.8-ha878542_0

  certifi                          2020.6.20-py37he5f6b98_2 --> 2020.11.8-py37h89c1867_0

  conda                                4.9.0-py37he5f6b98_1 --> 4.9.2-py37h89c1867_0

 

Proceed ([y]/n)? *y*

# I choose yes to update the conda to the latest version here

 

 Downloading and Extracting Packages

libsolv-0.7.16       | 455 KB    | ########################################################### | 100%

 certifi-2020.11.8    | 150 KB    | ########################################################### | 100%

 ca-certificates-2020 | 145 KB    | ########################################################### | 100%

 conda-4.9.2          | 3.0 MB    | ########################################################### | 100%

 Preparing transaction: done

Verifying transaction: done

Executing transaction: done

 

[ris_hpc_apps@tdragon4 ~]$ mamba create -n MAESTRO-1.2.1 -c liulab-dfci

  (since we already have MAESTO created for version 1.0.1, I choose a different conda environment name MAESTRO-1.2.1  )

  .......

   Getting  liulab-dfci linux-64

Getting  liulab-dfci noarch

Getting  conda-forge linux-64

Getting  conda-forge noarch

Getting  bioconda linux-64

Getting  bioconda noarch

Getting  pkgs/main linux-64

Getting  pkgs/main noarch

Getting  pkgs/r noarch

Getting  r linux-64

Getting  pkgs/r linux-64

Getting  dranew linux-64

Getting  r noarch

Getting  dranew noarch

 

Looking for: []

 

9 packages in https://conda.anaconda.org/liulab-dfci/linux-64

0 packages in https://conda.anaconda.org/liulab-dfci/noarch

139892 packages in https://conda.anaconda.org/conda-forge/linux-64

43184 packages in https://conda.anaconda.org/conda-forge/noarch

33562 packages in https://conda.anaconda.org/bioconda/linux-64

21237 packages in https://conda.anaconda.org/bioconda/noarch

18648 packages in https://repo.anaconda.com/pkgs/main/linux-64

2356 packages in https://repo.anaconda.com/pkgs/main/noarch

6637 packages in https://repo.anaconda.com/pkgs/r/linux-64

5025 packages in https://repo.anaconda.com/pkgs/r/noarch

6620 packages in https://conda.anaconda.org/r/linux-64

5025 packages in https://conda.anaconda.org/r/noarch

170 packages in https://conda.anaconda.org/dranew/linux-64

0 packages in https://conda.anaconda.org/dranew/noarch

 

## Package Plan ##

 

  environment location: /risapps/rhel7/python/3.7.3/envs/MAESTRO-1.2.1

 

Proceed ([y]/n)? y

 #
  # To activate this environment, use
  #
  #     $ conda activate MAESTRO-1.2.1
  #
  # To deactivate an active environment, use
  #
  #     $ conda deactivate

Preparing transaction: done

Verifying transaction: done

Executing transaction: done

At this point, I check the environment directory /risapps/rhel7/python/3.7.3/envs/MAESTRO-1.2.1;

I only notice one directory was generated, is this normal?

[ris_hpc_apps@tdragon4 MAESTRO]$ ls -l /risapps/rhel7/python/3.7.3/envs/MAESTRO-1.2.1

total 0

drwxr-xr-x 2 ris_hpc_apps rists 29 Nov 20 16:07 conda-meta

Note: I used our software installation account ris_hpc_apps to install this on HPC cluster. I suppose to let user to do "conda activate MAESTRO-1.2.1" ( "conda init bash" is needed before this)

Regards,
Rong

--giggleannotation for other species

Hi Chenfei,

Does MASTRO allow to process data of other species, like zebrafish? If so, how should I prepare the reference genome files, like giggle annotation files?

Thank you!
Yujie

Error in integrating Seurat scRNA and Signac scATAC objects

First, I readRDS for my Seurat/harmony and Signac clustered scRNA and scATAC objects, respectively.

Then, I tried to run the Incorporate function:
coembed <- Incorporate(RNA = rna, ATAC = atac, project = "Maestro_integrated", method = "MAESTRO")

But I get the following error:
Error: Cannot find 'ACTIVITY' in this Seurat object

I also tried:
coembed <- Incorporate(RNA = rna$RNA, ATAC = atac$ATAC, project = "Maestro_integrated", method = "MAESTRO")

But,
Error: Cannot find 'ATAC' in this Seurat object

Enhancement: Optimize the memory usage for calculating gene score (RP) for scATAC-seq with a large number of cells

I have to process a scATAC-seq dataset with over 48,000 cells after merging three conditions and 2 replicates each condition. I kept 200,000 top peaks. The scATAC_cellranger_count.py can finish successfully, however, even our high memory node (260G mem) of our HPC keeps killing scATAC_genescore.py. We need to optimize the memory usage of this script. This issue ticket will track the progress of the optimization.

Incompatibility with Seurat 4.0

When installing MAESTRO (774ce46) on top of satijalab/seurat/release/4.0.0:

** byte-compile and prepare package for lazy loading
Error: object ‘CreateGeneActivityMatrix’ is not exported by 'namespace:Seurat'
Execution halted
ERROR: lazy loading failed for package ‘MAESTRO’

Documentation clarification

It would be appreciated if the documentation explicitly mentioned that rownames of the input matrixes need to be gene symbols and not other annotations. Took me a while to figure out that LISA requires symbols, and that the mismatch in rownames of my ATAC and RNA data kept causing Incorporate() to segfault. However, I also admit I am using a slightly processed counts matrix for my RNA input, which used ensemble ids as the rownames.

How to generate those QC metrics plots?

Hi,

Thank you for this great work! I just have one question when I go through the Galleries & Tutorials. For the scRNA-seq example, after the RNARunSeurat() function, you use str() to check those features. How to plot those figures below the line of str(pbmc.RNA.res$RNA)?

Thanks,
Frank

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.