Code Monkey home page Code Monkey logo

cancer_epitopes_cshl's Introduction

Cancer Epitopes CSHL

Plan

Goal: given an SRA ID, prioritize and quantify variants with respect to immunogenicity (single score) + variant annotation

  • use pvacseq to generate the sequences around variants (wt/mutant)
  • use FRED2 to make binding predictions
  • given the different prediction score, make one immunogenicity score

  1. Annotate VCF
  2. Extract peptide sequences
  3. Define sample's MHC alleles
  4. Collect MHC-peptide affinity predictions
  5. Add our own classifier score

workflow

More info:


1) Annotate RNAseq VCF

nohup  variant_effect_predictor.pl \
   --input_file /home/data/vcf/hisat_tags_output_SRR1616919.sorted.vcf  \
    --format vcf \
     --terms SO --offline  --force_overwrite \
      --plugin Wildtype --plugin Downstream  \
       --dir /home/data/vep   --vcf --symbol \
        --fork 16  --coding_only --no_intergenic \
         --output_file /home/data/imm/hisat_tags_output_SRR1616919.sorted.annotated.vcf  & 

2a) Generate FASTA with pVACSeq and write to csv

The genome information stored in the vcf file need to be translated into corresponding peptide sequences. For this, we use parts of pVacSeq, which generates a csv file with 21-mers surrounding the mutation (both, WT and mutant form)

    cd $HOME  
    git clone https://github.com/NCBI-Hackathons/Cancer_Epitopes_CSHL.git
 
    source activate python3 
    
    cd $HOME/Cancer_Epitopes_CSHL/src   
    python -c 'import generate_fasta; generate_fasta.generate_fasta_dataframe("/home/devsci7/test.output.2", "/home/devsci7/step2.csv", 21,9)'  

2b) Define MHC locus for HLA genotyping

Tools for HLA genotyping typically re-align the raw reads in order to identify the HLA type from RNA-seq. To obtain the reads roughly aligned to these genes we need to define the region and specify it during the alignment process. The MHC complex consists of more than 200 genes located close together on chromosome 6. The script hla_type.sh extracts reads overlapping with the MHC locus and turns them into two fastq files. These reads are then re-aligned and analyzed by OptiType

  src/hla_type.sh -b /home/data/hisat_tags_output_SRR1616919.sorted.bam \
    -r NC_000006.12:29600000-33500000 -o test --path /opt/samtools/1.3.1/bin/

3) Compute immunogenicity for each peptide

Run the script: src/imm_predict/fred2_allele_prediction.py to compute the MHC binding affinity predictions for each reference and alternative allele.

Example

python2 ./src/imm_predict/fred2_allele_prediction.py \
        ./pvacseq_table.csv ./variant_immunogenicity.csv

Usage

       fred2_allele_prediction.py [--alleles=<alleles_list>] FILE_IN FILE_OUT                          
       fred2_allele_prediction.py -h | --help                                                          
                                                         
Argument Explanation
FILE IN (req.) Input csv file with... ???
FILE OUT (req.) Name for output csv file
alleles Comma separated list of target alleles, e.g., --alleles="B*27:20,B*83:01,A*32:15" [Default: use all]

4. Create and explore the training data set

  • GOAL: classifier learnt on peptides with reported immunogenic cancer neoantigens (WT sequences without immunogenicity)

Collection of training sets

To play around with the data sets, here are the current workflows:

# read in the data (this expects that you're in Cancer_Epitopes_CSHL;
# just check the paths that are hard-coded at the moment within the
# script
Rscript --vanilla --slave src/imm_explore/0-read.R
# this should have created data/immunogenic_SNVs-training_sets.csv

export TMPDIR=/tmp/
PYTHON=/opt/modules/i12g/anaconda/3-4.1.1/envs/python27/bin/python 

# calculate the binding affinities
PYTHON src/imm_explore/fred2_design_matrix.py \
	--input=data/immunogenic_SNVs-training_sets.csv \
	--output=data/immunogenic_SNVs-model_data.csv

Compute the background protein immunogenicity [optional]

This needs some discussion and perhaps work?

Given a FASTA file with all human proteins, compute the immunogenicity for all posible 9-mer peptides.

NOTE: This will take a very long time to compute!

Script: src/imm_predict/fred2_background.py.

Usage:
       fred2_background.py [--alleles=<alleles_list> --top_N=N] FILE_IN FILE_OUT                       
       fred2_background.py -h | --help   
Argument Explanation
FILE IN (req.) Input FASTA file with all human proteins, can be retrieved from ENSEMBL
FILE OUT (req.) Name for output csv file
top N Number of top N proteins to compute the background for. [Default: all].
alleles Comma separated list of target alleles, e.g., --alleles="B*27:20,B*83:01,A*32:15" [Default: use all]
Example:
python2 ./src/imm_predict/fred2_background.py \
        ./Homo_sapiens.GRCh38.pep.all.fixheader.fa ./background_peptides.csv

Copy file over

Not sure where this belongs?

cp  /home/devsci7/step2.fasta   /home/data/imm 

cancer_epitopes_cshl's People

Contributors

friedue avatar tweep avatar susannasiebert avatar avsecz avatar dcgenomics avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.