Most software is already installed on our directory. The only things you will need to install are the R packages.
On midway2 do:
module load R/3.5.1
install.packages('tidyverse')
install.packages('bigsnpr')
install.packages("vroom")
devtools::install_github("stephenslab/[email protected]")
BiocManager::install(c("GenomicRanges", "plyranges","rtracklayer"))
Obtain summary statistics and place them here
/project2/xinhe/CANCER_GERMLINE/SUMMARY_STATISTICS/
You will need summary statistics with the following 8 columns:
- chromosome (just the number, no "chr", hg19!)
- position (base pair position, hg19!)
- beta (if you have Odds Ratio, you will need to transform it to log(Odds Ratio))
- standard error (SE)
- reference allele (A,C,T,G)
- association/effect allele (A,C,T,G)
- some kind of SNP ID (rsID, or chr:position:a1:a2)
- p-value
Write down the column names somewhere for you will need them next.
Create a working directory and clone this repo into it
mkdir my_project_directory
cd my_project_directory
git clone [email protected]:aselewa/finemap-cancer.git .
Run the preprocessing step with THREE arguments:
./preprocess.sh SUMSTATS COLUMNS prefix
- SUMSTATS : full path to your summary statistics
- COLUMNS: a comma(,) delimited string with the columns we described above! Make sure they are in the order described above.
- prefix: a meaningful name for the cancer/data type. i.e. BrCa for breast cancer
Example:
./preprocess.sh /project2/xinhe/alan/Cancer/GERMLINE/SUMMARY_STATISTICS/colorectal_raw_sumstats.txt.gz \
chr,position_b37,ocac_overall_OR,ocac_overall_se,a0,a1,1KG_Phase3_ID,ocac_overall_pvalue \
"ColCa"
This can take about 10 minutes depending on the size of your GWAS. If this worked, you should see a folder called cleaned_sumstats
and inside there is {prefix}_cleaned_sumstats.txt.gz
inside. The first few lines of the new file should look like:
zcat ColCa_cleaned_sumstats.txt.gz | head -n 5
chr pos a0 a1 beta se snp pval zscore og_index bigSNP_index locus
1 13110 A G -0.02066 0.03231 1 0.52245 -0.6394305168678428 3 4 1
1 13116 G T -0.01255 0.01895 rs201725126 0.50772 -0.662269129287599 4 5 1
1 13118 G A -0.01255 0.01895 rs200579949 0.50772 -0.662269129287599 5 6 1
1 13550 A G 0.13184 0.11236 1 0.24066 1.173371306514774 7 11 1
IMPORTANT: you can have multiple summary statistics in the cleaned_sumstats/
folder. Just run the preprocess.sh
recipe for each summary statistics trait. MAKE SURE THEY HAVE DIFFERENT PREFIX!
Annotations should be a .bed
file with only the first 3 columns: chromosome, start, end. Chromosome should just be the number, no "chr" and should be in hg19/b37 coordinates. In your project directory, you should see a folder called annotations/
Go into this. You should now see a folder called {prefix}/
where prefix
is what you choose in step 2. Place your annotation file (.bed
extension in here).
It may be useful to look at existing annotations that I placed at /project2/xinhe/CANCER_GERMLINE/ANNOTATIONS/
Your annotation should look like this:
head mybed.bed
1 564999 566395
1 568591 569959
1 569105 570623
1 713254 715133
1 762120 763780
IMPORTANT: you can place as many bed-files/annotations in the prefix
directory as you want! Just make sure they have different names. These are the annotations that will be used for this specific trait.
We are now ready to run the main pipeline for finemapping.
With your project directory (my_project_directory
) from earlier, you should now see a file called torus_susie_snakemake.py
. Open this file using Vim or nano and edit the variable called pd
, by placing the full path to your working directory.
pd = 'YOUR PROJCT DIRECTORY'
Example:
pd = '/project2/xinhe/alan/Cancer/mydir'
To run the pipeline, execute the following:
sbatch submit_snake.sbatch
Make sure the job is running:
qstat -u CNETID
Once it is over, you should see a directory called susie_finemapping
which contains the finemapped results for each chromosome.