kircherlab / mprasnakeflow Goto Github PK
View Code? Open in Web Editor NEWnew implementation of MPRAsnakeflow fork of MPRAflow
License: MIT License
new implementation of MPRAsnakeflow fork of MPRAflow
License: MIT License
especially creating the log folder
also document that it is possible to run it in different folder
this line is wrong
MPRAsnakeflow/workflow/rules/assignment.smk
Line 163 in 0c1f1fc
Should be 513. Otherwise also reads that are not joined will be mapped.
Run MPRAsnakeflow on nullomers to see the stats for the MPRA (especially the DNA).
input data folder: /fast/groups/ag_kircher/MPRA/MPRAsnakeflow/resources/Nullomers
output data folder: /fast/groups/ag_kircher/MPRA/MPRAsnakeflow/results/nullomers
Config is:
nullomers:
bc_length: 15
umi_length: 16
data_folder: resources/Nullomers/data
experiment_file: resources/Nullomers/experiment.csv
demultiplex: False
assignments:
october_nullomer2bc: resources/Nullomers/october_nullomer2bc.tsv.gz
design_file: resources/Nullomers/design_file.fa
configs:
noZeros:
bc_threshhold: 10
minDNACounts: 1
minRNACounts: 1
No label file (maybe we can generate one with DISTAL, MID and PROXIMAL.
the script plot_perInsertCounts_correlation.R
is not performant at all! It takes on large data more that 100GB and and runs for more than two days...
It might be not needed at all.
Otherwise check taht it is in line with the reference from the assignment
only copy (on symlink) when there is only 1 input!
Before assignment we get different complexities for DNA or RNA.
We get RNA avg complexity of 6181380 and for DNA of 2413583. So DNA has 1/3 of RNA complexity.
(after assignment it is 1/2, 1028753 vs 2070490)
The question is why we get so different complexities. There are two options that came into our minds:
For the second it could be that a primer goes into an barcode. So we will loose one base and therefore reduce the complexity. So to find out if that happens we shoudl look at the base composition at every position in the BCs to see if e.g. the end of the BARCODE conains only 3 of 4 nucleotides.
Currently, pipeline fail on script side (e.g. user cancellation) does not stop SLURM processed and vice versa (e.g. out-of-memory error leads to script simply hanging).
Better handling could be introduced through:
https://stackoverflow.com/questions/52500725/snakemake-hangs-when-cluster-slurm-cancelled-a-job#59253812
similar to assignment implement BC_rev_comp: true
This config schema by @visze (unimplemented) looks better to me than previous one, because it's more human readable:
assignments:
october_nullomer2bc: resources/Nullomers/october_nullomer2bc.tsv.gz
october_nullomer2bc_fixed: resources/Nullomers/october_nullomer2bc_correctLength_removedMultiAssignments.tsv.gz
configs:
noZeros:
bc_threshold: 10
minDNACounts: 1
minRNACounts: 1
This may require detection of config
vs file
assignment automatically (by extension?)
Here's a (mock-up) of the current, more complicated schema:
assignments:
october_nullomer2bc:
type: file
assignment_file: resources/Nullomers/october_nullomer2bc.tsv.gz
october_nullomer2bc_fixed:
type: file
assignment_file: resources/Nullomers/october_nullomer2bc_correctLength_removedMultiAssignments.tsv.gz
configs:
noZeros:
bc_threshold: 10
DNA:
min_counts: 1
RNA:
min_counts: 1
Hello,
I'd like to run the MPRA snakeflow basic assignment workflow, and I'm running into this error with sra-tools for the tutorial dataset. Thanks for your help!
$ fastq-dump --gzip --split-files SRR10800986
2023-04-06T20:23:45 fastq-dump.2.8.0 err: item not found while constructing within virtual database module - the path 'SRR10800986' cannot be opened as database or table
In workflow/scripts/count/merge_label.py
TypeError: 'Series' objects are mutable, thus they cannot be hashed
[Fri Dec 17 02:35:10 2021]
Error in rule dna_rna_merge:
jobid: 0
output: results/test_basic_count/assigned_counts/standard/threshold50/HEPG2_3_merged_assigned_counts.tsv.gz, results/test_basic_count/stats/assigned_counts/standard/threshold50/HEPG2_3_merged_assigned_counts.statistic.tsv.gz
conda-env: /fast/work/users/dashpm_c/MPRAsnakeflow/.snakemake/conda/6d7bef2d429def6aea33ad74edc041dc
shell:
python workflow/scripts/count/merge_label.py --counts results/test_basic_count/counts/merged/withoutZeros/HEPG2_3_merged_counts.tsv.gz --minRNACounts 1 --minDNACounts 1 --assignment resources/SRR10800986_filtered_coords_to_barcodes.tsv.gz --output results/test_basic_count/assigned_counts/standard/threshold50/HEPG2_3_merged_assigned_counts.tsv.gz --statistic results/test_basic_count/stats/assigned_counts/standard/threshold50/HEPG2_3_merged_assigned_counts.statistic.tsv.gz
(one of the commands exited with non-zero exit code; note that snakemake uses bash strict mode!)
Shutting down, this might take some time.
Exiting because a job execution failed. Look above for error message
https://github.com/kircherlab/MPRAsnakeflow/blob/master/README.md
The README.md for this workflow instructs users to "Create a new github repository using this workflow as a template." while it appears that the option to use this repo as a template is unavailable
create additional barcode distribution plots on the maximum number of counts in addition to maximum of 50
Actual we have multiple paths we can walk trough the pipeline so that we get different endpoint files.
The advantage is:
Drawback:
Right now I can identify 4 different settings that are used in the path:
The merging with zeros has only be done when RNA or DNA is allowed to be zero. So it somehow already set by the config.
I think I would keep the assignment file separate because this is something externally. So we then have only 2 additional wildcards (beside condition, DNA/RNA, replicate): config and assignment.
config file uses misspelled parameter :-(
Despite previous steps being split into (default) 300 fragments, mapping spawns only one samtools cat | bwa mem
process (albeit multithreaded). It would be substantially faster if it spawned 300 processes. Would require a separate collection-sorting step.
Assignment workflow is missing in MPRAsnakeflow. We can implement the one of MPRAflow. But I think the one used for the MPRA-ML data might be nicer because we can adjust the assigned BCs in terms of multiple matches.
There are a lot of questions about analysing stand specificity. So it might be a good idea to add some documentation about this.
Technically it is not possible to analyse stand specificity. Mapping is not possible to regions that are identical (in terms of forward, reverse, complement and reverse complement).
When analysing strand specificity you have to add by design different nucleotides at the beginning and and at the end so that mapping (BWA mem) is able to discriminate between the sequences. When using the restriction enzyme free approach the two 15bp adapters can be used for that. But they have to be in the design file and these adapters should not be trimmed in reads.
In the rule assignment_collectBC, sort is used with the argument --batch-size=X, where X is the number of splits used in the assignment.
The minimum should be 2 for --batch-size, so sort fails if there is only one split.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.