engreitzlab / variant-flowfish Goto Github PK

View Code? Open in Web Editor NEW

1.0 1.0 0.0 414 KB

Analysis for Variant-FlowFISH experiments

License: GNU General Public License v3.0

Python 56.00% Shell 2.10% R 41.90%

variant-flowfish's People

Contributors

Stargazers

Watchers

variant-flowfish's Issues

make_count_tables is picking the first reference_sequence arbitrarily

https://github.com/EngreitzLab/variant-flowfish/blob/main/workflow/rules/make_count_tables.smk#L27

We should have it pick the most frequent one (as intended?) or account for all of the unique reference sequences.

make sure count table is sorted so that trim_count_table rule works

trim_count_table rule takes the head N lines of the count table file; don't think the make_count_tables function is currently sorting the output so we should make sure that is happening.

make_count_tables refactor

outline:
- for each sample:
  - match variants with .str.contains() and remove from sample
  - from the rest, choose the most frequent refAllele as the "true" reference and quantify mismatches with Aligned_Sequence
- collect variants and references from each sample, aggregate and create count tables.
todos:
- add mismatch function from @engreitz
- plot mismatches in reference

write rule to run alignment of sample sequences to PhiX with bowtie2

PhiX FASTA: https://www.ncbi.nlm.nih.gov/nuccore/NC_001422.1?report=fasta

move variant matching to before MLE (maybe in make_count_tables rule)

Variant matching to MappingSequence currently occurs in normalize_allele_effects.py after MLE, should move this to the count table creation step.

add reference alignment error threshold in AmpliconInfo.tsv and read into aggregate_variants function

fix make_count_tables.smk

In https://github.com/EngreitzLab/variant-flowfish/blob/main/workflow/rules/make_count_tables.smk:

In this line -

variant-flowfish/workflow/rules/make_count_tables.smk

Line 27 in 1212739

ref_seq = allele_tbl.loc[allele_tbl['Aligned_Sequence'] == allele_tbl['Reference_Sequence'], 'Reference_Sequence'].values[0]

is it supposed to be exact matching Aligned_Sequence and Reference_Sequence? Seems like in tests, there aren't any matches which breaks the code.
- From @lampburglar: If our amplicon is ~250 nucleotides long then the probability having an amplicon without any substitutions is .99^250 or ~8% which makes it seem like you should still be able to find an exact match amongst the 100K reads that there will be. There must be some unstable region in the amplicon that yields lots of background error.
- I am rerunning CRISPResso and the rest of the pipeline. We were thinking that we can just check the region that there is supposed to be the edit, and if just this region is an exact match then the sequence can be considered an unedited reference sequence.
separate out functions as another script in scripts/

engreitzlab / variant-flowfish Goto Github PK

variant-flowfish's People

Contributors

Stargazers

Watchers

variant-flowfish's Issues

make_count_tables is picking the first reference_sequence arbitrarily

make sure count table is sorted so that trim_count_table rule works

make_count_tables refactor

write rule to run alignment of sample sequences to PhiX with bowtie2

move variant matching to before MLE (maybe in make_count_tables rule)

add reference alignment error threshold in AmpliconInfo.tsv and read into aggregate_variants function

fix make_count_tables.smk

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent