Code Monkey home page Code Monkey logo

siglasso's Introduction

Image of icon

Optimizing Cancer Mutation Signatures Jointly with Sampling Likelihood

Image of siglasso

Why you should use sigLASSO?

First of all, it is because mutation counts matter. Most other methods, including deconstructSigs, are invariant to different mutation counts. Let's say we observed 10,000 mutations in a cancer sample, this number is large therefore we should have a pretty good estimation of the mutational spectrum of the tumor. Here is a synthetic example.

Context Counts Percentage
AC>AA 511 5.11%
AC>AC 305 3.05%
AC>AG 96 0.96%
AC>AT 913 9.13%
... ... ...
TT>GT 0 0

But life is not always so good. This time we sequenced another tumor but observed only 100 mutations. It could be because of exome sequencing, shallow sequencing depth or "silent" tumor. Now we are highly unsure about the spectrum.

Context Counts Percentage
AC>AA 5 5%
AC>AC 3 3%
AC>AG 1 1%
AC>AT 9 9%
... ... ...
TT>GT 0 0

Although the percentage is almost exactly the same as the previous sample, we are much less certain about our spectrum estimation here. For example, in the first sample, with 0/10,000 TT>GT, we are pretty certain that this mutation context is very rare. However, in the second sample, the last zero is very likely due to undersampling rather than a real, very low mutation frequency. Therefore, we should expect these two samples to have two different signature solutions. A more confident one for the first one, and a coarse one for the second sample to reflect the uncertainty in sampling. Most current signature tools will give identical solutions when the percentage matches.

SigLASSO considers both sampling error (especially important when the mutation count is low) and signature fitting.

Moreover, it parameterizes the model empirically. Let the data speak for itself. Moreover, you will be able to feed prior knowledge of the signatures into the model in a soft thresholding way. No more picking up signature subsets by hand! SigLASSO achieves signature selection by using L1 regularization.

Dependencies

To fetch the package from GitHub, you will need "devtools"

install.packages("devtools")
library("devtools")

If start with vcf files, you will additionally need bedtools and a reference genome FASTA file. See the below.

Install

Just one line and voilà!

devtools::install_github("gersteinlab/siglasso")

Usage

It is as simple as 2 (or 3) steps!

0. Starting with VCF

If started with VCF, that is if you do not have the flanking nucleotide context. To make the package transparent and lightweight, we decide to wrap around a bash script to parse mutational context. To do this, you will need bedtools(https://bedtools.readthedocs.io). For Mac OS, I find package managers, like homebrew(https://brew.sh/), greatly simplify the installation.

brew install bedtools

Then you also need to have an uncompressed reference genome that is compatible with your vcf ready.

Before running, one last thing is to prepare a meta file specify the paths of all your vcf files. An optional 2nd column can be used to specify sample names. Otherwise, it will grab the filenames as sample names.

/Users/data/breast/brca_cancer_1.vcf, brca_a_1
/Users/data/breast/brca_cancer_1.vcf, brca_b_1
/Users/data/prostate/prca_cancer_1.vcf, prca_c_1

Run

my_spectrum <- vcf2spec(bedtools_path, vcf_meta, ref_genome, output_file, ...)

This function will write "output_file" to disk. It is a context file.

1. Starting with a context file

Alternatively, you can start with a context file. It is easier when you already know the flanking region context. Many annotation tools now give you this.

The context file has three columns. The 1st column is the reference nucleotide context. Typically, it is a trinucleotide. The 2nd column stores the alternative alleles. The last column has the sample names (can be unordered).

ATT    C    brca_a_1
CGT    C    brca_a_2
TCC    A    brca_a_1

Now convert this raw mutation context file into a spectrum.

my_context_file <- read.table(path_to_context_file)
my_spectrum <- context2spec(my_context_file)

The package comes with an example of 7 AML mutations downloaded from ftp://ftp.sanger.ac.uk/pub/cancer/AlexandrovEtAl (Alexandrovet al., 2013). It contains 3,414 SNVs, already in the context file format.

data(aml_7_wgs)
my_spectrum <- context2spec(aml_7_wgs)

It will make plots of the spectrum, to know more, see plot_spectrum() Image of plot_spectrum

Note: the specturm could be further "normalized". But the COSMIC signatures are built on a mixture of WGS and WES samples, it is unclear such normarlization will provide any advantage. Meanwhile, normalization here destroyed the discrete nature of the mutation counts. So please do not use normarlized spectrum as input for siglasso (step 2). siglasso() provides a normalization option applied to the signatures (which is not recommended either for COSMIC signatures).

2. Apply siglasso to spectrum

This step is straightforward. You can supply your own signature file, or it will use the COSMIC signature.

my_sigs <- siglasso(my_spectrum, ...)

A useful thing is prior. Pass a vector of the length of the number of signatures, with numbers between 0 (strong preference, we recommend using a small number, e.g. 0.01, rather than 0) and 1 (no preference). We supply a prior file that we curated from COSMIC. The prior file consists of binary indicators of a signature is observed in COSMIC studies (value: 0) or not (value: 1). To apply an easy flat prior on previously observed signatures, just multiple the prior vector with a value less than 1. Here is an example.

data(cosmic_priors)
colnames(cosmic_priors)
my_prior = ifelse(cosmic_priors$BRCA==0, 0.1, 1) #adjust the strength to 0.1, BRCA
my_sigs <- siglasso(my_spectrum, prior = my_prior[1:30]...) #the last row is "other signatures"

3. Visualization

By default, siglasso() automatically generates a barplot of every sample

plot_sigs(my_sigs, ...)

There is another option to generate a dotchart to compare between samples or samples groups. For illustration, we randomly group the first four samples and the last three into group 1 & 2.

plot_sigs_grouped(my_sigs, c(rep(1,4), rep(2,3))...)

Image of plots

4. Documentation

You can use the build-in documentation in R, or download a pdf version of the pacakge manual at (http://labmisc.gersteinlab.org/siglasso_1.1.pdf)

Longer Introduction

Multiple mutational processes drive carcinogenesis, leaving characteristic signatures on tumor genomes. Determining the active signatures from the full repertoire of potential ones can help elucidate mechanisms underlying cancer initiation and development. This task involves decomposing the counts of cancer mutations, tabulated according to their trinucleotide context, into a linear combination of known mutational signatures. We formulate it as an optimization problem and develop sigLASSO, a software tool, to carry it out efficiently. SigLASSO features four key aspects: (1) It jointly optimizes the likelihood of sampling and signature fitting, by explicitly adding multinomial sampling into the overall objective function, This is particularly important when mutation counts are low and sampling variance is high, such as in exome sequencing. (2) sigLASSO uses L1 regularization to parsimoniously assign signatures to mutation profiles, leading to sparse and more biologically interpretable solutions resembling previously well-characterized results. (3) sigLASSO fine-tunes model complexity, informed by the scale of the data and biological-knowledge based priors. In particular, instead of hard thresholding and choosing a priori a discrete subset of active signatures, sigLASSO allows continuous priors, which can be effectively learned from auxiliary information. (4) Because of this, sigLASSO can assess model uncertainty and abstain from making certain assignments in low-confidence contexts. Finally, to evaluate SigLASSO signature assignments in comparison to other approaches, we develop a set of reasonable expectations (e.g. sparsity, the ability to abstain, and robustness to noise) that we apply consistently in a variety of contexts.

Citation

This work is accepted by Nature Communications, please cite https://doi.org/10.1038/s41467-020-17388-x

Data

Download sigLASSO output on TCGA samples with COSMIC (v2) priors (http://labmisc.gersteinlab.org/all_sigs.tsv.tar.gz)

What is new

version 1.1 (2020 April)

  • Support for the newest COSMIC signature v3 (both whole genome and exome)
  • Also better visualization support

siglasso's People

Contributors

shantaol avatar

Stargazers

Paul L. Maurizio avatar Axel Rosendahl Huber avatar M. Nazif Taşbaş avatar  avatar Taro Matsutani avatar Anthony Ferrari avatar Markus Riester avatar Sasha Gusev avatar  avatar  avatar

Watchers

James Cloos avatar minxu33 avatar  avatar Mengting Gu avatar Jeremy Kavalan avatar Hugo Lam avatar  avatar David Z. Chen avatar  avatar  avatar Donghoon Lee avatar Koon-Kiu Yan avatar  avatar Rob Kitchen avatar  avatar Joel Rozowsky avatar  avatar  avatar Mihali Felipe avatar  avatar  avatar  avatar  avatar  avatar  avatar

siglasso's Issues

error in cv.glmnet()

Hi,

I got a error when I ran
cv_output <- cv.glmnet(x_vars[train,], y_var[train],
alpha = 1, lambda = lambda_seq,nfolds = 5)

Error in elnet(xd, is.sparse, ix, jx, y, weights, offset, type.gaussian, :
y is constant; gaussian glmnet fails at standardization step.

Can someone help me out?

Thanks,

maomao

Extremely slow with many samples

Hello,

I'm evaluating siglasso for our analysis pipeline, and I have followed the instructions to generate the mutational spectrum, and then ran the analysis.

However, unless there is some caveat I've missed when reading the documentation, the calculations are very slow. I ran siglasso on 83 samples from a 16Mbp capture panel (and with a low number of variants per sample) and it has now been running for more than 8 hours with no sign of completion in sight.

Is there anything one can do to either parallelize or speed up the computation?

Fail due to glmnet error

Hello, when I used siglasso on my data, the glmnet package threw an error:

Error in elnet(x, is.sparse, ix, jx, y, weights, offset, type.gaussian, y is constant; gaussian glmnet fails at standardization step

What do you think might be the problem? What can I do to solve it? Thank you!

p.s., I'm using a combination of our exome sequencing data and several TCGA cohorts

Error: all penalty factors are <= 0

I'm running siglasso with some samples that may be problematic, and I get errors like

Error: from glmnet Fortran code (error code 10000); All penalty factors are <= 0

I'm wondering if siglasso should just skip these (and fill the relevant sample data with NA). At the same time, is there any way to find why these samples are problematic? A few of these had too few variants (and were promptly removed), but others have reasonable numbers (50+).

Error in context2spec(context_file) : Unexpected mutation context!

Hi,

I have some vcf 4.1 files which call from VarScan2
Then I make a vcf metafile as below

/WholeExomeSequencing/VCFformat4.1_VarScan/recal-T115_S126.vcf    recal-T115_S126.vcf

and run in R

my_spectrum <- vcf2spec(bedtools_path, vcf_meta, ref_genome, output_file)

then output is

awk: cmd. line:1: warning: escape sequence `\/' treated as plain `/'
awk: cmd. line:1: warning: escape sequence `\/' treated as plain `/'
awk: cmd. line:1: warning: escape sequence `\/' treated as plain `/'
awk: cmd. line:1: warning: escape sequence `\/' treated as plain `/'
awk: cmd. line:1: warning: escape sequence `\/' treated as plain `/'
awk: cmd. line:1: warning: escape sequence `\/' treated as plain `/'
awk: cmd. line:1: warning: escape sequence `\/' treated as plain `/'
awk: cmd. line:1: warning: escape sequence `\/' treated as plain `/'
awk: cmd. line:1: warning: escape sequence `\/' treated as plain `/'
awk: cmd. line:1: warning: escape sequence `\/' treated as plain `/'
[1] "TC>CC"
Error in context2spec(context_file) : Unexpected mutation context!

I have checked the output_file (context_file), get_context.sh is work

head -n 5 context_file
TCT     A       recal-T115_S126.vcf
TCT     A       recal-T115_S126.vcf
CTT     C       recal-T115_S126.vcf
CAA     G       recal-T115_S126.vcf
ATA     G       recal-T115_S126.vcf

Is the TC>CC a wrong mutation format ?

Thank,
Gray

Mistaken Chromosome Name Mismatch Error

I am running vcf2spec but I see lots of errors like "WARNING. chromosome (1) was not found in the FASTA file. Skipping." The FASTA and VCF use the same style of chromosome names.

$ grep \> hs38DH.fasta | head 
>chr1  AC:CM000663.2  gi:568336023  LN:248956422  rl:Chromosome  M5:6aef897c3d6ff0c78aff06ac189178dd  AS:GRCh38
>chr2  AC:CM000664.2  gi:568336022  LN:242193529  rl:Chromosome  M5:f98db672eb0993dcfdabafe2a882905c  AS:GRCh38
>chr3  AC:CM000665.2  gi:568336021  LN:198295559  rl:Chromosome  M5:76635a41ea913a405ded820447d067b0  AS:GRCh38
>chr4  AC:CM000666.2  gi:568336020  LN:190214555  rl:Chromosome  M5:3210fecf1eb92d5489da4346b3fddc6e  AS:GRCh38
>chr5  AC:CM000667.2  gi:568336019  LN:181538259  rl:Chromosome  M5:a811b3dc9fe66af729dc0dddf7fa4f13  AS:GRCh38  hm:47309185-49591369
>chr6  AC:CM000668.2  gi:568336018  LN:170805979  rl:Chromosome  M5:5691468a67c7e7a7b5f2a3a683792c29  AS:GRCh38
>chr7  AC:CM000669.2  gi:568336017  LN:159345973  rl:Chromosome  M5:cc044cc2256a1141212660fb07b6171e  AS:GRCh38
>chr8  AC:CM000670.2  gi:568336016  LN:145138636  rl:Chromosome  M5:c67955b5f7815a9a1edfaa15893d3616  AS:GRCh38
>chr9  AC:CM000671.2  gi:568336015  LN:138394717  rl:Chromosome  M5:6c198acf68b5af7b9d676dfdd531b5de  AS:GRCh38
>chr10  AC:CM000672.2  gi:568336014  LN:133797422  rl:Chromosome  M5:c0eeee7acfdaf31b770a509bdaa6e51a  AS:GRCh38

The VCF file has the same style of chromosome IDs.

$ grep ^chr OSCC_1-P_pass.vcf | head
chr1    455086  .       C       A       .       PASS    CONTQ=93;DP=110;ECNT=1;GERMQ=93;MBQ=29,33;MFRL=347,303;MMQ=25,30;MPOS=56;NALOD=1.36;NLOD=6.57;POPAF=0.924;ROQ=93;SEQQ=93;STRANDQ=35;TLOD=17.87  GT:AD:AF:DP:F1R2:F2R1:SB0/0:22,0:0.042:22:10,0:11,0:13,9,0,0    0/1:76,11:0.134:87:31,6:41,5:45,31,3,8
chr1    613111  .       C       T       .       PASS    CONTQ=22;DP=65;ECNT=1;GERMQ=49;MBQ=33,33;MFRL=394,309;MMQ=41,32;MPOS=3;NALOD=1.4;NLOD=6.92;POPAF=0.117;ROQ=27;SEQQ=13;STRANDQ=36;TLOD=6.48      GT:AD:AF:DP:F1R2:F2R1:SB0/0:23,0:0.039:23:10,0:13,0:11,12,0,0   0/1:30,4:0.137:34:13,4:17,0:15,15,2,2
chr1    653704  .       G       A       .       PASS    CONTQ=4;DP=78;ECNT=1;GERMQ=93;MBQ=33,34;MFRL=385,258;MMQ=60,41;MPOS=39;NALOD=1;NLOD=5.09;POPAF=0.876;ROQ=34;SEQQ=5;STRANDQ=15;TLOD=5.54 GT:AD:AF:DP:F1R2:F2R1:SB       0/0:18,0:0.051:18:8,0:9,0:12,6,0,0       0/1:47,3:0.08:50:20,2:22,1:30,17,1,2
chr1    1140424 .       A       T       .       PASS    CONTQ=41;DP=37;ECNT=1;GERMQ=18;MBQ=33,32;MFRL=385,163;MMQ=47,47;MPOS=13;NALOD=0.996;NLOD=2.64;POPAF=3.49;ROQ=56;SEQQ=10;STRANDQ=9;TLOD=6.24     GT:AD:AF:DP:F1R2:F2R1:SB0/0:9,0:0.092:9:6,0:2,0:8,1,0,0 0/1:22,3:0.147:25:14,2:7,1:21,1,3,0
chr1    1253674 .       C       T       .       PASS    CONTQ=93;DP=164;ECNT=1;GERMQ=93;MBQ=32,33;MFRL=382,397;MMQ=60,60;MPOS=43;NALOD=1.7;NLOD=14.39;POPAF=2.4;ROQ=66;SEQQ=93;STRANDQ=72;TLOD=26.99    GT:AD:AF:DP:F1R2:F2R1:SB0/0:48,0:0.02:48:26,0:21,0:27,21,0,0    0/1:99,12:0.115:111:40,7:58,5:59,40,6,6
chr1    1678117 .       G       A       .       PASS    CONTQ=93;DP=170;ECNT=1;GERMQ=93;MBQ=30,33;MFRL=386,429;MMQ=60,60;MPOS=38;NALOD=1.72;NLOD=15.29;POPAF=6;ROQ=78;SEQQ=93;STRANDQ=66;TLOD=24.72     GT:AD:AF:DP:F1R2:F2R1:SB0/0:51,0:0.019:51:31,0:19,0:18,33,0,0   0/1:102,11:0.104:113:54,6:47,5:57,45,6,5
chr1    1952695 .       G       A       .       PASS    CONTQ=93;DP=147;ECNT=2;GERMQ=93;MBQ=25,33;MFRL=376,367;MMQ=60,60;MPOS=39;NALOD=1.59;NLOD=11.29;POPAF=4.61;ROQ=69;SEQQ=93;STRANDQ=80;TLOD=43.4   GT:AD:AF:DP:F1R2:F2R1:SB0/0:38,0:0.025:38:19,0:16,0:21,17,0,0   0/1:87,18:0.178:105:41,12:41,6:60,27,13,5
chr1    2032531 .       C       T       .       PASS    CONTQ=93;DP=175;ECNT=1;GERMQ=93;MBQ=26,33;MFRL=391,407;MMQ=60,60;MPOS=27;NALOD=1.72;NLOD=15.63;POPAF=6;ROQ=66;SEQQ=81;STRANDQ=46;TLOD=13.34     GT:AD:AF:DP:F1R2:F2R1:SB0/0:52,0:0.019:52:25,0:27,0:27,25,0,0   0/1:112,8:0.073:120:57,4:49,4:59,53,4,4
chr1    2964332 .       G       A       .       PASS    CONTQ=93;DP=145;ECNT=1;GERMQ=93;MBQ=32,34;MFRL=380,355;MMQ=60,60;MPOS=54;NALOD=1.6;NLOD=11.73;POPAF=6;ROQ=57;SEQQ=93;STRANDQ=67;TLOD=26.28      GT:AD:AF:DP:F1R2:F2R1:SB0/0:39,0:0.024:39:20,0:19,0:21,18,0,0   0/1:91,11:0.115:102:48,7:43,4:46,45,5,6
chr1    3081664 .       A       G       .       PASS    CONTQ=93;DP=155;ECNT=1;GERMQ=93;MBQ=33,32;MFRL=396,389;MMQ=60,60;MPOS=38;NALOD=1.69;NLOD=14.44;POPAF=6;ROQ=93;SEQQ=93;STRANDQ=93;TLOD=128.56    GT:AD:AF:DP:F1R2:F2R1:SB0/0:48,0:0.02:48:25,0:23,0:27,21,0,0    0/1:61,44:0.42:105:32,19:29,25:30,31,24,20

It looks like sigLASSO removes the chr prefix from the chromosome ID without asking the user, which is bad form. Surely, the user should be allowed to use chromosome names such as chr1, chr2.

non-human genome

Hi,
I was wondering if the program can be used with counts obtained from a non-human genome. I have generated the required input and ran SigLasso successfully, but I wonder if it can be safely used with whole genome data from a non-human organism or are there assumptions that could be broken.
Thanks!

`siglasso` function takes too long running time

i got mutations from almost 1000 samples, then try siglasso to get the mutational signature as suggested. However, it took more than 12 hours, and it is still running until now.
I have no idea whether it is ok in this case.
here is the log the function returns

 my_sigs <- siglasso(tempspec,plot=F)
[1] "No signature supplied, will use COSMIC v2 signatures, we also supply v2 priors in the package"
[1] "There are 942 samples

look forward to your reply.
@ShantaoL

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.