Code Monkey home page Code Monkey logo

mprasnakeflow's People

Contributors

pyareedash avatar visze avatar

Stargazers

 avatar  avatar

Watchers

 avatar  avatar

mprasnakeflow's Issues

Run MPRAsnakeflow on nullomers

Run MPRAsnakeflow on nullomers to see the stats for the MPRA (especially the DNA).

input data folder: /fast/groups/ag_kircher/MPRA/MPRAsnakeflow/resources/Nullomers

output data folder: /fast/groups/ag_kircher/MPRA/MPRAsnakeflow/results/nullomers

Config is:

nullomers:
  bc_length: 15
  umi_length: 16
  data_folder: resources/Nullomers/data
  experiment_file: resources/Nullomers/experiment.csv
  demultiplex: False
  assignments:
    october_nullomer2bc: resources/Nullomers/october_nullomer2bc.tsv.gz
  design_file: resources/Nullomers/design_file.fa
  configs:
    noZeros:
      bc_threshhold: 10
      minDNACounts: 1
      minRNACounts: 1

No label file (maybe we can generate one with DISTAL, MID and PROXIMAL.

Base composition on barcodes

Before assignment we get different complexities for DNA or RNA.

We get RNA avg complexity of 6181380 and for DNA of 2413583. So DNA has 1/3 of RNA complexity.
(after assignment it is 1/2, 1028753 vs 2070490)

The question is why we get so different complexities. There are two options that came into our minds:

  1. Sequencing errors in RNA
  2. Discrading of (the same pool) of DNA barcodes.

For the second it could be that a primer goes into an barcode. So we will loose one base and therefore reduce the complexity. So to find out if that happens we shoudl look at the base composition at every position in the BCs to see if e.g. the end of the BARCODE conains only 3 of 4 nucleotides.

Simplified config schemas

This config schema by @visze (unimplemented) looks better to me than previous one, because it's more human readable:

    assignments:
      october_nullomer2bc: resources/Nullomers/october_nullomer2bc.tsv.gz
      october_nullomer2bc_fixed: resources/Nullomers/october_nullomer2bc_correctLength_removedMultiAssignments.tsv.gz
    configs:
      noZeros:
        bc_threshold: 10
        minDNACounts: 1
        minRNACounts: 1

This may require detection of config vs file assignment automatically (by extension?)

Here's a (mock-up) of the current, more complicated schema:

    assignments:
      october_nullomer2bc:
        type: file
        assignment_file: resources/Nullomers/october_nullomer2bc.tsv.gz
      october_nullomer2bc_fixed:
        type: file
        assignment_file: resources/Nullomers/october_nullomer2bc_correctLength_removedMultiAssignments.tsv.gz
    configs:
      noZeros:
        bc_threshold: 10
        DNA:
          min_counts: 1
        RNA:
          min_counts: 1

basic assignment workflow

Hello,
I'd like to run the MPRA snakeflow basic assignment workflow, and I'm running into this error with sra-tools for the tutorial dataset. Thanks for your help!

$ fastq-dump --gzip --split-files SRR10800986
2023-04-06T20:23:45 fastq-dump.2.8.0 err: item not found while constructing within virtual database module - the path 'SRR10800986' cannot be opened as database or table

Test run: TypeError: 'Series' objects are mutable, thus they cannot be hashed

In workflow/scripts/count/merge_label.py

TypeError: 'Series' objects are mutable, thus they cannot be hashed
[Fri Dec 17 02:35:10 2021]
Error in rule dna_rna_merge:
    jobid: 0
    output: results/test_basic_count/assigned_counts/standard/threshold50/HEPG2_3_merged_assigned_counts.tsv.gz, results/test_basic_count/stats/assigned_counts/standard/threshold50/HEPG2_3_merged_assigned_counts.statistic.tsv.gz
    conda-env: /fast/work/users/dashpm_c/MPRAsnakeflow/.snakemake/conda/6d7bef2d429def6aea33ad74edc041dc
    shell:
        
        python workflow/scripts/count/merge_label.py --counts results/test_basic_count/counts/merged/withoutZeros/HEPG2_3_merged_counts.tsv.gz         --minRNACounts 1 --minDNACounts 1         --assignment resources/SRR10800986_filtered_coords_to_barcodes.tsv.gz         --output results/test_basic_count/assigned_counts/standard/threshold50/HEPG2_3_merged_assigned_counts.tsv.gz         --statistic results/test_basic_count/stats/assigned_counts/standard/threshold50/HEPG2_3_merged_assigned_counts.statistic.tsv.gz
        
        (one of the commands exited with non-zero exit code; note that snakemake uses bash strict mode!)

Shutting down, this might take some time.
Exiting because a job execution failed. Look above for error message

More filters in association steps?

  1. It would be nice to have strand-sensitive mode (especially with BWA, by default it aligns both strands equally)
  2. Number of mismatches allowed

barcode distribution plot

create additional barcode distribution plots on the maximum number of counts in addition to maximum of 50

introducing general configs

Actual we have multiple paths we can walk trough the pipeline so that we get different endpoint files.

The advantage is:

  • Fast computation, because nothing has to be rerun if we just want to add a new option (e.g. for downsampling)

Drawback:

  • Difficult to add new settings. Complete refactoring of the workflow
  • Reduced overview. We are getting many files of the same kind (e.g. BC correlation file for multiple configs and settings)

Right now I can identify 4 different settings that are used in the path:

  1. Assignment
  2. Sampling
  3. merging with/without zeros
  4. config

The merging with zeros has only be done when RNA or DNA is allowed to be zero. So it somehow already set by the config.

I think I would keep the assignment file separate because this is something externally. So we then have only 2 additional wildcards (beside condition, DNA/RNA, replicate): config and assignment.

parallelization of association mapping step

Despite previous steps being split into (default) 300 fragments, mapping spawns only one samtools cat | bwa mem process (albeit multithreaded). It would be substantially faster if it spawned 300 processes. Would require a separate collection-sorting step.

Assignment workflow

Assignment workflow is missing in MPRAsnakeflow. We can implement the one of MPRAflow. But I think the one used for the MPRA-ML data might be nicer because we can adjust the assigned BCs in terms of multiple matches.

Documentation about strand-sensitive anaylsis

There are a lot of questions about analysing stand specificity. So it might be a good idea to add some documentation about this.

Technically it is not possible to analyse stand specificity. Mapping is not possible to regions that are identical (in terms of forward, reverse, complement and reverse complement).

When analysing strand specificity you have to add by design different nucleotides at the beginning and and at the end so that mapping (BWA mem) is able to discriminate between the sequences. When using the restriction enzyme free approach the two 15bp adapters can be used for that. But they have to be in the design file and these adapters should not be trimmed in reads.

Error in assignment workflow when using batch size of 1

In the rule assignment_collectBC, sort is used with the argument --batch-size=X, where X is the number of splits used in the assignment.
The minimum should be 2 for --batch-size, so sort fails if there is only one split.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.