Code Monkey home page Code Monkey logo

ctdna's Introduction

ctDNA pipeline

for ctDNA analysis

manual

perl pipeline/pip.clean.pl 
-in inpath data 
-out outpath 
-target target bed (defult $bin/../data_base/first.panel.bed)
-fa fa (defult /data1/software/b37/human_g1k_v37.fasta)
-db dataBase (defult /data1/software/b37)
-anno anno database (defult /data1/software/annovar/)

pipeline description

flowchart

digraph G {
  compound=true;
  CleanData -> Chemical;
  Chemical [label = "get CHEMICAL medicine type"];
  CleanData -> PEmap;
  PEmap [label = "paired reads mapping"];
  raw->mapping->bam_filter->var_calling->var_filter->result[arrowhead=none];
  PEmap -> DEunpair;
  Rmdup1 [label = "rmdup for fusion"];
  Realign [label = "recal && realign"];
  Target [label = "get targeted bam"];
  DEunpair [label = "delete unmapped or \n unpaired reads"];
  FilterOther[fillcolor=yellow, style="rounded,filled", shape=box];
  Filter[fillcolor=yellow, style="rounded,filled", shape=box];
  STAT[fillcolor=yellow, style="rounded,filled", shape=box];
  Filter1[fillcolor=yellow, style="rounded,filled", shape=box];
  Filter2[fillcolor=yellow, style="rounded,filled", shape=box];
  Filter_fusion1[fillcolor=yellow, style="rounded,filled", shape=box];
  ANNO_fusion[shape=box];
  CHEMICAL_medicine_Type[fillcolor=yellow, style="rounded,filled", shape=diamond, shape=box];
  RmXA [label = "rm Multiple alignment"];
  FilterOther [label ="filter no md or cigar\nfilter much mismatch:\ninsersion and deletion >= 3\nall type mismatch > 5\nsoft clip and border 5bp uncount"];
  MergeDP [label = "merge business and control depth"];
  Filter [label = "filter cnv by gene:\nmore than 70% base of one gene > 0.7\nbase of one gene detected base > 500"];
  Chrcalling [label = "calling each chr snv by mutect2"];
  Merge [label = "merge vcf"];
  Add [label ="add cosmic and rs data"];
  STAT [label = "stat mismatch had high quality:\nmapping quality >=30\ninsersion and deletion < 2\nall type mismatch <= 3\nborder > 5bp\nbase quality >=20 or 80% nearby 10bp base quality >= 20"];
  Filter1 [label = "filter depth low:\ndepth all one base >= 30 \ndepth of alt >=5\nalt ratio >= 0.002"];
  Filter2 [label = "get mismatch had high quality:\n1.save the snv supported by >= 5 differ type reads\n and >= 1 positive read >= 1 reverse read\n2.indel have no high quality reads support tagged as non-filter\n3.snp have no high quality reads support tagged as filtered"];
  FUSION [label = "CREST detect SV raw result"];
  Filter_fusion1 [label = "filter:\none or more percent_identity < 0.9\none or more depth < 4\nboth side out of target +-50bp"];
  POS [label = "get the postion of both split side"];
  ANNO_fusion [label = "annovar for the both side split pos\n to get gene information"];
  merge_result [label = "merge the annovar result to sv result"];
  subgraph cluster0 {
    label = "mapping bwa";
    DEunpair -> Sorted;
    Sorted -> Target;
    Target -> Realign;
    Sorted -> Rmdup1;
    Realign -> Rmdup;
    Rmdup1 -> "sample.rmdup.bam"[splines="line"];
    Realign -> "sample.target.realigned.rmxa.bam";
    {rank=same;  Rmdup, Realign};
  }
  subgraph cluster1 {
    label = "bam filter";
    RmXA -> FilterOther;
    FilterOther -> "sample.target.realigned.removemultiple.bam";
  }
  subgraph cluster3{
    label = "CNV calling";
    "sample.target.bam.depth" -> MergeDP;
    MergeDP -> CNVcalling;
    CNVcalling -> Filter;
  }
  subgraph cluster4{
    label = "SNV calling";
    Chrcalling -> Merge;
    Merge -> Add;
    Add -> ANNO;
  }
  subgraph cluster5{
    label = "somatic SNV filter";
    STAT -> Filter1;
    Filter1 -> Filter2;
    Filter2 -> "sample.somtic.new.xlsx";
  }
  subgraph cluster6{
    label = "FUSION detect by CREST";
    FUSION -> Filter_fusion1;
    Filter_fusion1 -> POS;
    POS -> ANNO_fusion;
    ANNO_fusion -> merge_result;
  }
  splines=ortho;
  CHEMICAL_medicine_Type [label = "CHEMICAL medicine Type:\nhom_alt alt_ratio >= 0.8\nhet alt_ratio >= 0.3\nother hom_ref\nthe ssr is not detect,\n one of it is typed by indel"];
  "sample.rmdup.bam" -> FUSION;
  "sample.target.realigned.rmxa.bam" -> RmXA;
  "sample.target.realigned.removemultiple.bam" -> Chrcalling;
  Target -> "sample.target.bam.depth";
  ANNO -> STAT;
  Chemical -> CHEMICAL_medicine_Type;
  Merge -> CHEMICAL_medicine_Type;
  CHEMICAL_medicine_Type -> "sample.CHEMICAL.medicine.xlsx";
  merge_result -> "fusion.result.xlsx";
  Filter -> "cnv.xlsx";
  "cnv.xlsx"[fillcolor=green, style="rounded,filled", shape=box];
  "sample.somtic.new.xlsx"[fillcolor=green, style="rounded,filled", shape=box];
  "sample.CHEMICAL.medicine.xlsx"[fillcolor=green, style="rounded,filled", shape=box];
  "fusion.result.xlsx"[fillcolor=green, style="rounded,filled", shape=box];
  "sample.target.realigned.rmxa.bam"[fillcolor=pink, style="filled",shape=box];
  "sample.target.realigned.removemultiple.bam"[fillcolor=pink, style="filled",shape=box];
  "sample.target.bam.depth"[fillcolor=pink, style="filled",shape=box];
  "sample.rmdup.bam"[fillcolor=pink, style="filled",shape=box];
  { rank=same; raw; CleanData;}
  { rank=same; bam_filter; "sample.target.realigned.rmxa.bam", "sample.target.bam.depth";}
  { rank=same; var_calling; "sample.rmdup.bam","sample.target.realigned.removemultiple.bam";}
  { rank=same; var_filter;  FUSION;}
  { rank=same; result; "sample.somtic.new.xlsx","sample.CHEMICAL.medicine.xlsx","fusion.result.xlsx","cnv.xlsx";}
}

get CHEMICAL medicine type

  • 1 for 肿瘤个体化用药
  • 2 for 肺癌
  • 3 for 结直肠癌

mapping

bwa

  1. paired reads mapping
  2. delete unmapped reads and unpaired reads
  3. sort bam
  4. get targeted bam ; for snp, indel, cnv calling
  5. recal
  6. realign
  7. rmdup
    • after 6; just stat duplication
    • after 3; for fusion detect

filter

for snp, indel, cnv calling

  1. rm Multiple alignment
  2. no md or cigar
  3. mapping quality < 30
  4. filter much mismatch(too much mismatch in one read maybe caused by wrong mapping)
    • insersion and deletion >= 3
    • all type mismatch > 5
    • #soft clip and border 5bp uncount

bam stat

  1. mapping reads
  2. target ratio (evalulate the panel Capture efficiency)
  3. depth target depth
  4. coverage (differ depth coverage can reflect can it be used to calling snv)

CNV

CNV calling

  1. get depth of target region
  2. merge business and control depth
  3. cnv calling for each base

CNV result

  1. filter
    • mosre than 70% base of one gene > 0.7 (detect gene as unit)
    • bases of one gene > 500 (too little have too much false positives)

SNV

SNV calling

  1. calling each chr snv by mutect2 (because it can both save time and avoid GATK bug when it do much time)
  2. merge vcf

ANNO

  1. add cosmic and rs data
  2. annovar

somatic snv filter

  1. stat mismatch had high quality (prepare filter false positives)

    • mapping quality >=30
    • insersion and deletion < 2
    • all type mismatch <= 3
    • border > 5bp
    • base quality >=20 or 80% nearby 10bp base quality >= 20
  2. filter depth low (low depth or ratio generaly means false positives)

    • depth all one base >= 30
    • depth of alt >=5
    • alt ratio >= 0.002
  3. filter by high quality reads (because not do rmdup, to avoid snv origin is PCR)

    • get mismatch had high quality
    • save the snv supported by >= 5 differ type reads and >= 1 positive read >= 1 reverse read
    • indel have no high quality reads support tagged as non-filter ( because mutect2 do cluster and Stitching, the postion may changed)
    • snp have no high quality reads support tagged as filtered

CHEMICAL medicine

  1. hom_alt alt_ratio >= 0.8 (homozygous mutation)
  2. het 0.8 > alt_ratio >= 0.3 (heterozygote mutation)
  3. other hom_ref (ref type homozygous)
  4. the ssr is not detect, one of it is typed by indel

FUSION

  1. SV detect by CREST
  2. filter
    • one or more percent_identity < 0.9 (maybe wrong mapped postion)
    • one or more depth < 4 (maybe wrong or ratio is very low)
    • both side out of target +-50bp (one side must in the target, start and end expand 50bp)
  3. anno (the anno software can not anno varation between different chr)
    • get the postion of both side
    • annovar for the both side
    • merge the annovar result to sv result
  4. anno fiter
    • filtered-intronic-same(just SV, not gene fusion)
    • filtered-low_coverage both side < 300 (one side must in the target, start and end expand 50bp and depth >= 300 )
    • filtered-intergenic(just SV, not gene fusion)
    • filtered-strand(just SV, not gene fusion)

dependents

  • bwa
  • samtools
  • GATK
  • CREST
  • Rscript
  • java
  • blat server (/gfServer start 192.168.1.205 8000 /home/liubei/bin/x86_64/CREST/human_g1k_v37.fasta.2bit)

need developments

  1. some data have some problems for experimental stage may can not done or need very long time for crest
  2. cnv can splited to two type plasma and others
  3. SSR detect ?

dependents perl packages

  • Bio::DB::Sam;
  • Data::Dumper;
  • Excel::Writer::XLSX;
  • File::Basename
  • File::Spec;
  • FindBin
  • Getopt::Long;
  • IO::All
  • List::Util
  • Spreadsheet::XLSX;
  • Statistics::Basic
  • Text::Iconv;

dependents R library

  • cghFLasso

ctdna's People

Contributors

liubei8852 avatar

Watchers

James Cloos avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.