Code Monkey home page Code Monkey logo

freec2gistic's Introduction

freec2gistic

Workflow: Convert CONTROL-freec output to GISTIC2

Aim

The presented workflow converts the output generated by CONTROL-freec to GISTIC2 to be able to find high confident somatic copy-number alterations.

External guidelines

I found this resource which was useful to get the correct input format for GISTIC2

https://www.biostars.org/p/426635/

Input

From CONTROL-freec the following two files are needed: PREFIX _ratio.txt and PREFEX sample.cpn

Additionally, CONTROL-freec provides a script to convert output files to BED format (freec2bed.pl; https://github.com/BoevaLab/FREEC/tree/master/scripts/freec2bed.pl)

The ratio file contains information about ratios and predicted copy number alterations for each window. The cpn file contains information about the number of reads per called window.

Furthermore, a file with chromosome name and length is needed (e.g. ref_hg19.genome).

Convert the ratio file to BED

Run the freec2bed.pl script on the ratio file, e.g. freec2bed -f Lx002031.bam_ratio.txt > Lx002031.freec_segments.bed

If cohorts are analyzed, the following lines of code loop through a directory (assumption: all files are stored in the same directory)

INPUTRATIO=/some/path/to/freec/output/files

for file in ls $INPUTRATIO*bam_ratio.txt; do
	echo $file
	OUTFILE=$(echo $file | sed -e "s/bam_ratio.txt/freec_segments.bed/g")
	freec2bed.pl -f $file > $OUTFILE
done

Join BED files in R

Files are joined and the ratio is transformed applying log2(ratio)-1 (according to https://www.biostars.org/p/426635/)

library(tidyverse)
library(GenomicRanges)

# list of files with pattern matching
files <- list.files(path = "../data/lymphome_WXS/cnvs", pattern="freec_segments.bed", full.names=TRUE)

segData <- NULL
for( i in files ) {
    print(i)
    tab <- read.table(i)
    id <- gsub(pattern = "../data/lymphome_WXS/cnvs.", replacement = "", i)
    id <- gsub(pattern = ".freec_segments.bed", replacement = "", id)
    tab$Sample <- id
    tab$Dummy <- 'NA'
    tab <- tab[, c("Sample", "V1", "V2", "V3", "Dummy", "V4")]
    # convert ratio to log2(ratio)-1
    tab$V4 <- log2(tab$V4) - 1
    segData <- rbind(segData, tab)
}

Next, chromosomes X and Y are removed and columns are renamed.

segData <- segData %>% 
    dplyr::filter( V1 != 'chrX' & V1 != 'chrY') %>% 
    dplyr::rename( Chromosome = V1, Start = V2, End = V3, log2ratio = V4 )

segData$Chromosome <- gsub(pattern = "chr", replacement = "", segData$Chromosome)

Add coverage information and merge ratio and coverage

Retrieve coverage information:

files <- list.files(path = "../data/lymphome_WXS/cnvs", pattern="bam_sample.cpn", full.names=TRUE)
cpnData <- NULL
for( i in files ) {
    print(i)
    tab <- read.table(i)
    id <- gsub(pattern = "../data/lymphome_WXS/cnvs.", replacement = "", i)
    id <- gsub(pattern = ".bam_sample.cpn", replacement = "", id)
    tab$Sample <- id
    tab <- tab[, c("Sample", "V1", "V2", "V3", "V4", "V5")]
    cpnData <- rbind(cpnData, tab)
}

colnames(cpnData) <- c("Sample", "Chromosome", "Start", "End", "Number", "Range")

Combine coverage and ratios

segData_num <- NULL
for ( i in unique( cpnData$Sample ) ) {
    print(i)
    cpn.s <- cpnData %>% dplyr::filter( Sample == i ) %>% dplyr::select( -Sample ) %>% GRanges
    seg.s <- segData %>% dplyr::filter( Sample == i ) %>% dplyr::select( -Sample ) %>% dplyr::rename(Num_Probes=Dummy) %>% GRanges
    
    # sum features per interval
    for ( j in 1:nrow( data.frame(seg.s) ) ) {
        Num_Probes <- sum( data.frame( cpn.s[ c(data.frame( findOverlaps(seg.s[j,], cpn.s) )$subjectHits), ] )$Number )
        mcols( seg.s[j, ] )$Num_Probes <- Num_Probes
    }
    
    seg.s <- data.frame(seg.s)
    seg.s$Sample <- i
    segData_num <- rbind(segData_num, seg.s)
    
}

segData_num <- segData_num %>% 
    dplyr::select( Sample, seqnames, start, end, Num_Probes, log2ratio)

Finally, save all files

write.table(x = segData, file = "gistic.segments.txt", sep = "\t", row.names = F, col.names = F, quote = F)
write.table(x = segData_num, file = "gistic.segments.numProbes.txt", sep = "\t", row.names = F, col.names = F, quote = F)

If necessary, common CNAs can be removed as described below. Here, a panel of common mutations for hg19 was retrieved from a Broad Institute panel of normals (ftp://ftp.broadinstitute.org/pub/GISTIC2.0/hg19_support/)

common.cnvs <- GRanges( read.delim( file = "../data/CNV.hg19.bypos.111213.txt" ) )
# get index of all CNVs without a hit in common.cnv
#subsetByOverlaps( GRanges(cnvdata[4, 1:4]), common.cnvs ) 
idx <- which( countOverlaps( GRanges(segData[, -1]), common.cnvs ) == 0 )
segData <- segData[idx, ]
rownames(segData) <- NULL

Prepare input for GISTIC2

The second line in the loop before can be skipped by adjusting the R code above. Anyhow, it does not hurt.

INPUT_FREEC=gistic.segments.numProbes.txt
REF_IDX=ref_hg19.genome
WORK_DIR=gistic_input

mkdir -p $WORK_DIR

for sample in $(cut -f1 $INPUT_FREEC | uniq); do
    echo $sample
    grep "$sample\t" $INPUT_FREEC |cut -f2-6|sed 's/^/chr/' > $WORK_DIR/$sample.bed
    bedtools sort -i $WORK_DIR/$sample.bed -faidx $REF_IDX > $WORK_DIR/$sample.sorted.bed
    bedtools complement -i $WORK_DIR/$sample.sorted.bed -g $REF_IDX > $WORK_DIR/$sample.complement.bed
    awk '{print $0, "\tNA\t0"}' $WORK_DIR/$sample.complement.bed > $WORK_DIR/$sample.CN2.bed
    cat $WORK_DIR/$sample.sorted.bed $WORK_DIR/$sample.CN2.bed > $WORK_DIR/$sample.whole.bed
    bedtools sort -i $WORK_DIR/$sample.whole.bed -faidx $REF_IDX > $WORK_DIR/$sample.whole.sorted.bed
    awk -v sample=$sample '{print sample"\t" $0}' $WORK_DIR/$sample.whole.sorted.bed > $WORK_DIR/$sample.whole.sorted.seg
done

cd $WORK_DIR
find . -name "*.whole.sorted.seg" -exec cat {} \; > ../samples.seg
cd ..

The file samples.seg is the final input for GISTIC2. Basically, we are done here

Running GISTIC2

GISTIC2 uses a couple of MATLAB scripts. A docker image and install instructions are provided here

https://hub.docker.com/r/shixiangwang/gistic

Basically, just run docker pull shixiangwang/gistic to install the image.

To run the image use

sudo docker run -it shixiangwang/gistic /bin/bash

change the working directory and create the input and ouput directory

cd /opt/GISTIC/
mkdir indat
mkdir gistic_out

We have to transfer our samples.seg file to the docker image. Therefor, we have to retrieve the container id for the running image and copy the file (here, the ID is 16d57225d045 but it depends on your local environment).

docker container ls
docker cp samples.seg 16d57225d045:/opt/GISTIC/indat/

Now, we can run GISTIC2

./gistic2 -b gistic_out -seg indat/samples.seg -refgene refgenefiles/hg19.mat

and copy the result files (from outside the docker image)

docker cp 16d57225d045:/opt/GISTIC/gistic_out ./

DONE

freec2gistic's People

Contributors

kunstner avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.