Code Monkey home page Code Monkey logo

cleaver's People

Contributors

dtenenba avatar hpages avatar jwokaty avatar nturaga avatar sgibb avatar sonali-bioc avatar vobencha avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

cleaver's Issues

In some cleaved peptides, number of cleavage sites > number of defined missed cleavages

When we digest proteins with a certain number of missed cleavages (0:M), the maximum number of cleavage sites per peptide is expected to be in the ranges (0, M). But for a certain number of peptides, the number of cleavage sites in it exceed the missedCleavages value specified in the initial digestion.

In the below example case, we can see there are 78 peptides that have more than 2 cleavage sites, even though the allowed number of missed cleavages was defined as missedCleavages=0:2 during trypsin digestion.

Test proteins fasta: proteins.fasta.gz

library(cleaver)

## read fasta
proteins <- readAAStringSet("proteins.fasta.gz")

## number of proteins in proteins.fasta
length(proteins)
## [1] 38

## digest proteins with trypsin
cleaved <- cleaver::cleave(proteins, missedCleavages = 0:2, enzym = "trypsin")

## unlist into AAStringSet
peptides <- unlist(cleaved)

## rename individual peptides as: id::peptide
names(peptides) <- paste0(base::strsplit(names(cleaved), "\\|")[[1]][2], 
                          "::", as.character(peptides))

## get cleaved sites within peptides
missed <- cleaver::cleavageSites(peptides, enzym = "trypsin")

## number of peptides with cleavage sites > 2
length(missed[elementNROWS(missed) > 2])
## [1] 78

## peptides with more with cleavage sites > 2
head(missed[elementNROWS(missed) > 2])
## $`A6NL46::RRKK`
[1] 1 2 3

$`A6NL46::RRKK`
[1] 1 2 3

$`A6NL46::RRAVSMDNGAKFLR`
[1]  1  2 11

$`A6NL46::RRPMIYVESSEESSDEQPDEVESPTQSQDSTPAEEREDEGASAAQGQEPEADSQELVQPKTGCELGDGPDTK`
[1]  1 36 60

$`A6NL46::RRQEGKCK`
[1] 1 2 6

$`A6NL46::RRGSSIPQFTNSPTMVIMVGLPARGK`
[1]  1  2 24

And there's also a mismatch between the number of ranges and peptides after enzymatic digestion:

cleaved <- cleaver::cleave(proteins, missedCleavages = 0, enzym = "trypsin")
ranges <- cleaver::cleavageRanges(proteins, missedCleavages = 0, enzym = "trypsin")
sites <- cleaver::cleavageSites(proteins, enzym = "trypsin")

sum(lengths(cleaved))
## [1] 17072
sum(lengths(ranges) )
## [1] 23260
sum(lengths(sites))
## [1] 23222 
sum(lengths(sites)) + length(proteins)
## [1] 23260

peptides <- unlist(cleaved)
names(peptides) <- paste0(base::strsplit(names(cleaved), "\\|")[[1]][2], 
                          "::", as.character(peptides))
missed <- cleaver::cleavageSites(peptides, enzym = "trypsin")
length(missed[elementNROWS(missed) > 0])
## 55

swap pepsin 1.3 and pepsin > 2 cleavage rules

Here, the regex for pepsin 1.3 ([FLWY]) is less stringent than the one for pepsin > 2 ([FL])

cleaver/R/rules.R

Lines 60 to 63 in c02266a

## Pepsin (pH 1.3)
"pepsin1.3"="((?<=([^HKR][^P])|(^[^P]))[^R](?=[FLWY][^P]))|((?<=([^HKR][^P])|(^[^P]))[FLWY](?=\\w[^P]))",
## Pepsin (pH > 2.0)
"pepsin"="((?<=([^HKR][^P])|(^[^P]))[^R](?=[FL][^P]))|((?<=([^HKR][^P])|(^[^P]))[FL](?=\\w[^P]))",

But in the documentation of peptidecutter it says that pepsin at pH 1.3 is more specific.

Should the label for these regular expressions be swapped?

trypsin protease and PLGS cleavage rules

It has previously been demonstrated that trypsin has digestion problems if the AA in the vicinity of the K|R in the cleavage sites are

  1. P; trypsin is incapable of cutting if P is in +1 position. Arguable (ref1)
  2. acidic amino acids in the vicinity of cleavage site inhibit trypsin activity but not to 100%.(ref2)
  3. trypsin cannot cut if the cleavage site is flanked by 1 amino acid. As a result staggered end will result in two dead end products, i.e. XXXXBBYYYY will result in both XXXXBB + YYYY and XXXXB + BYYYY. (ref3)
  4. trypsin is reported to have little capacity as DIPEPTYDIL PEPTIDASE, but can function as PEPTYDIL DIPEPTYDASE. It means it can cut when 2 amino acids are present on C terminus, but cannot when two amino acids are -resent on N terminus. Hence XBYYYYY is dead-end product, but XXXXXBY appears to be cut again to yeild XXXXX (ref3)

The PLGS takes into consideration some of these rules and allows missed cleaved peptides to be pepFrag1 or pepFrag2 (i.e. suitable for quantitation), i.f. the clevage site is followed by P, K, R, D, E. Note the rule only applies to the amino acid directly following K|R, although D and E seem to have an inhibitory effect on trypsin activity at positions -3:+3.

my suggestion is to allow the peptides with "special" missed cleavages for quantitation. I.e. when creating a vector of proteotypic peptides both peptides with sequence XXXXK and XXXXKEXXXR should be present.

references:

  1. Does trypsin cut before proline?
  2. Large-Scale Quantitative Assessment of Different In-Solution Protein Digestion Protocols Reveals Superior Cleavage Efficiency of Tandem Lys-C/Trypsin Proteolysis over Trypsin Digestion.
  3. The importance of the digest: proteolysis and absolute quantification in proteomics. http://dx.doi.org/10.1016/j.ymeth.2011.05.005

move tests

testthat 0.8 comes with a new recommended structure for storing your tests. To
better meet CRAN recommended practices, testthat now recommend that you to put
your tests in tests/testthat, instead of inst/test (this makes it
possible for users to choose whether or not to install tests). With this
new structure, you'll need to use test_check() instead of test_packages()
in the test file (usually tests/testthat.R) that runs all testthat unit
tests.

adding overhangs for synthetic peptides and QconCATs

it is quite clear now, that 100% digestion efficiency with trypsin should not be assumed in proteomics workflows. Inefficient trypsin digestion also posses a very serious problems in absolute quantitation workflows using labelled isotopic standards.

The way isotopic standards are currently used is peptides to be quantified are synthesised labelled. Then a known amount of the labelled peptide is spiked in the sample prior to its analysis by LC-MS. After the acquisition the amount of unlabelled peptide (and hence its protein of origin), is computed as foolows

quantity_unlabelled = signal_unlabelled/signal_labelled * quantity_labelled

Consider quantitation of the following peptide: VTTYFPSVNLR. Below is a piece of protein sequence it originates from:

GNIR.VTTYFPSVNLR.KSSQK

note to get the peptide out of the protein digestion should occur after R, however R is followed by K, which is expected to result in two dead-end products:

VTTYFPSVNLR and VTTYFPSVNLRK

as a result the amount of VTTYFPSVNLR peptide is no longer proportional to protein amount and if absolute quantitation is performed using this peptide only, the amount of protein will be underestimated (a specific example of this happening is given in ref1).

The most obvious approach to counteract the problem is to ignore peptides like this. However this is not usually possible, given that only a limited amount of peptides suitable for quantitation is available per every protein. Thus the best solution is to mimic cleavage site by adding 3 amino acids before and after.

However consider the following peptide:

QNGRLR.HFTIPSHR.ARAGR

if we add RLR on N-teminus of peptide sequence again the cleavage site does not mimic what happens in the protein since if cleavage occurs after the first R in the protein it yeilds a dead end product:

LR.HFTIPSHR

hence the overhang needs to be extended 3 aa before the RLR. However this extension of overhangs is not always possible, since there is a limit to peptide's length (usually a synthetic peptide of no longer than 20aa) can be synthesised, hence additional parameters need to be passed to the model to determine the optimal compromise.

I will write out a detailed outline of the workflow if this functionality is to be added to cleaver.

references:

  1. Kito, Keiji, et al. "A synthetic protein approach toward accurate mass spectrometric quantification of component stoichiometry of multiprotein complexes." Journal of proteome research 6.2 (2007): 792-800.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.