The mprasnakeflow from kircherlab

document running on cluster

especially creating the log folder
also document that it is possible to run it in different folder

Assignment: not matched reads where mapped

this line is wrong

MPRAsnakeflow/workflow/rules/assignment.smk

Line 163 in 0c1f1fc

samtools view -F 514 {input.bams} | \

Should be 513. Otherwise also reads that are not joined will be mapped.

Run MPRAsnakeflow on nullomers

Run MPRAsnakeflow on nullomers to see the stats for the MPRA (especially the DNA).

input data folder: /fast/groups/ag_kircher/MPRA/MPRAsnakeflow/resources/Nullomers

output data folder: /fast/groups/ag_kircher/MPRA/MPRAsnakeflow/results/nullomers

Config is:

nullomers:
  bc_length: 15
  umi_length: 16
  data_folder: resources/Nullomers/data
  experiment_file: resources/Nullomers/experiment.csv
  demultiplex: False
  assignments:
    october_nullomer2bc: resources/Nullomers/october_nullomer2bc.tsv.gz
  design_file: resources/Nullomers/design_file.fa
  configs:
    noZeros:
      bc_threshhold: 10
      minDNACounts: 1
      minRNACounts: 1

No label file (maybe we can generate one with DISTAL, MID and PROXIMAL.

rewrite plot_perInsertCounts_correlation.R

the script plot_perInsertCounts_correlation.R is not performant at all! It takes on large data more that 100GB and and runs for more than two days...

Reference in count workflow

It might be not needed at all.

Otherwise check taht it is in line with the reference from the assignment

MPRAsnakeflow - Getting started

Compare ratios - DNA and RNA II (with all DNA_counts = 1)

rule assignment_getInputs

only copy (on symlink) when there is only 1 input!

Base composition on barcodes

Before assignment we get different complexities for DNA or RNA.

We get RNA avg complexity of 6181380 and for DNA of 2413583. So DNA has 1/3 of RNA complexity.
(after assignment it is 1/2, 1028753 vs 2070490)

The question is why we get so different complexities. There are two options that came into our minds:

Sequencing errors in RNA
Discrading of (the same pool) of DNA barcodes.

For the second it could be that a primer goes into an barcode. So we will loose one base and therefore reduce the complexity. So to find out if that happens we shoudl look at the base composition at every position in the BCs to see if e.g. the end of the BARCODE conains only 3 of 4 nucleotides.

BC on multiple inserts

check how many
filter them out
rerun

Refactoring MPRAsnakeflow - with sampling

MPRA-ENCODE comparison

Downsampling - Add upperlimit

Downsampling assignment

better handling of SLURM-related failures (low priority)

Currently, pipeline fail on script side (e.g. user cancellation) does not stop SLURM processed and vice versa (e.g. out-of-memory error leads to script simply hanging).

Better handling could be introduced through:
https://stackoverflow.com/questions/52500725/snakemake-hangs-when-cluster-slurm-cancelled-a-job#59253812

add an option to us ethe revese complement BC within experiment workflow

similar to assignment implement BC_rev_comp: true

Simplified config schemas

This config schema by @visze (unimplemented) looks better to me than previous one, because it's more human readable:

    assignments:
      october_nullomer2bc: resources/Nullomers/october_nullomer2bc.tsv.gz
      october_nullomer2bc_fixed: resources/Nullomers/october_nullomer2bc_correctLength_removedMultiAssignments.tsv.gz
    configs:
      noZeros:
        bc_threshold: 10
        minDNACounts: 1
        minRNACounts: 1

This may require detection of config vs file assignment automatically (by extension?)

Here's a (mock-up) of the current, more complicated schema:

    assignments:
      october_nullomer2bc:
        type: file
        assignment_file: resources/Nullomers/october_nullomer2bc.tsv.gz
      october_nullomer2bc_fixed:
        type: file
        assignment_file: resources/Nullomers/october_nullomer2bc_correctLength_removedMultiAssignments.tsv.gz
    configs:
      noZeros:
        bc_threshold: 10
        DNA:
          min_counts: 1
        RNA:
          min_counts: 1

replace MergeTrimReadsBAM.py

try this:

Compare MPRA results with ENCODE (ATAC, DNAse, Histone-ChIP)

BCs of different length in assignment

filter them out
count how many
rerun with new assignment file

Compare ratios - DNA and RNA I (normal)

basic assignment workflow

Hello,
I'd like to run the MPRA snakeflow basic assignment workflow, and I'm running into this error with sra-tools for the tutorial dataset. Thanks for your help!

$ fastq-dump --gzip --split-files SRR10800986
2023-04-06T20:23:45 fastq-dump.2.8.0 err: item not found while constructing within virtual database module - the path 'SRR10800986' cannot be opened as database or table

Test run: TypeError: 'Series' objects are mutable, thus they cannot be hashed

In workflow/scripts/count/merge_label.py

TypeError: 'Series' objects are mutable, thus they cannot be hashed
[Fri Dec 17 02:35:10 2021]
Error in rule dna_rna_merge:
    jobid: 0
    output: results/test_basic_count/assigned_counts/standard/threshold50/HEPG2_3_merged_assigned_counts.tsv.gz, results/test_basic_count/stats/assigned_counts/standard/threshold50/HEPG2_3_merged_assigned_counts.statistic.tsv.gz
    conda-env: /fast/work/users/dashpm_c/MPRAsnakeflow/.snakemake/conda/6d7bef2d429def6aea33ad74edc041dc
    shell:
        
        python workflow/scripts/count/merge_label.py --counts results/test_basic_count/counts/merged/withoutZeros/HEPG2_3_merged_counts.tsv.gz         --minRNACounts 1 --minDNACounts 1         --assignment resources/SRR10800986_filtered_coords_to_barcodes.tsv.gz         --output results/test_basic_count/assigned_counts/standard/threshold50/HEPG2_3_merged_assigned_counts.tsv.gz         --statistic results/test_basic_count/stats/assigned_counts/standard/threshold50/HEPG2_3_merged_assigned_counts.statistic.tsv.gz
        
        (one of the commands exited with non-zero exit code; note that snakemake uses bash strict mode!)

Shutting down, this might take some time.
Exiting because a job execution failed. Look above for error message

Erroneous template instructions in README

https://github.com/kircherlab/MPRAsnakeflow/blob/master/README.md

The README.md for this workflow instructs users to "Create a new github repository using this workflow as a template." while it appears that the option to use this repo as a template is unavailable

Split Downsampling (DNA, RNA)

Side - Parse bigWig files

LiftOver from GRCh37 -> GRCh38

Add Upperlimit feature - Downsampling

More filters in association steps?

It would be nice to have strand-sensitive mode (especially with BWA, by default it aligns both strands equally)
Number of mismatches allowed

barcode distribution plot

create additional barcode distribution plots on the maximum number of counts in addition to maximum of 50

introducing general configs

Actual we have multiple paths we can walk trough the pipeline so that we get different endpoint files.

The advantage is:

Fast computation, because nothing has to be rerun if we just want to add a new option (e.g. for downsampling)

Drawback:

Difficult to add new settings. Complete refactoring of the workflow
Reduced overview. We are getting many files of the same kind (e.g. BC correlation file for multiple configs and settings)

Right now I can identify 4 different settings that are used in the path:

Assignment
Sampling
merging with/without zeros
config

The merging with zeros has only be done when RNA or DNA is allowed to be zero. So it somehow already set by the config.

I think I would keep the assignment file separate because this is something externally. So we then have only 2 additional wildcards (beside condition, DNA/RNA, replicate): config and assignment.

Documentation of MPRAsnakeflow

Compare results - BC_threshold 1, 10, 50

simplify conda environments

except python 2.7

rename bc_threshhold to bc_threshold

config file uses misspelled parameter :-(

parallelization of association mapping step

Despite previous steps being split into (default) 300 fragments, mapping spawns only one samtools cat | bwa mem process (albeit multithreaded). It would be substantially faster if it spawned 300 processes. Would require a separate collection-sorting step.

Assignment workflow

Assignment workflow is missing in MPRAsnakeflow. We can implement the one of MPRAflow. But I think the one used for the MPRA-ML data might be nicer because we can adjust the assigned BCs in terms of multiple matches.

Documentation about strand-sensitive anaylsis

There are a lot of questions about analysing stand specificity. So it might be a good idea to add some documentation about this.

Technically it is not possible to analyse stand specificity. Mapping is not possible to regions that are identical (in terms of forward, reverse, complement and reverse complement).

When analysing strand specificity you have to add by design different nucleotides at the beginning and and at the end so that mapping (BWA mem) is able to discriminate between the sequences. When using the restriction enzyme free approach the two 15bp adapters can be used for that. But they have to be in the design file and these adapters should not be trimmed in reads.

Error in assignment workflow when using batch size of 1

In the rule assignment_collectBC, sort is used with the argument --batch-size=X, where X is the number of splits used in the assignment.
The minimum should be 2 for --batch-size, so sort fails if there is only one split.

kircherlab / mprasnakeflow Goto Github PK

mprasnakeflow's People

Contributors

Stargazers

Watchers

Forkers

mprasnakeflow's Issues

Recommend Projects

Recommend Topics

Recommend Org