dosorio / peptides Goto Github PK

An R package to calculate indices and theoretical physicochemical properties of peptides and protein sequences.

R 88.22% C++ 10.93% C 0.85%

peptides protein-sequences cran calculate-indices bioinformatics qsar

peptides's Introduction

Hi🖖🏼! I'm Daniel Osorio, a Colombian computational biologist. I work as a Senior Scientific Consultant at QIAGEN Digital Insight. I earned my doctorate in Biomedical Sciences, specializing in Biomedical Genomics and Bioinformatics, from Texas A&M University. My research focuses on developing software for high-throughput single-cell RNA-seq data analysis, gene regulatory networks, data mining, metabolic modeling, and bioactive peptides.

peptides's People

Contributors

Stargazers

Watchers

peptides's Issues

Feature request: Travis CI build badges on README + volunteer

Hi @dosorio,

In my previous Pull Request you can observe I incorrectly thought you would not have Travis CI.

To prevent such misunderstandings from happening, I suggest to add a Travis CI build badge in the README, such as this one:

Sure, I'd be happy to add it with a Pull Request 👍

P.S. I do know how to fix that build error as well 🤐

problem due to version?

Hi Dosorio,

the last version 2.4.3 is not present on CRAN.

I am using macrel, which depends on peptides and getting this error:

rpy2.rinterface_lib.embedded.RRuntimeError: Error: package or namespace load failed for ‘Peptides’:
 package ‘Peptides’ was installed before R 4.0.0: please re-install it

This problem doesn't get solved after re-installing r-peptides (doing it this way: conda install -c r r-peptides)

Any suggestions?
Kind Regards
Dany

Warnings for selenocystein (U) and pyrrolysin (O)

Hi.
I saw that the mw function features masses for selenocystein (U) and pyrrolysin (O). However, the aaCheck function does not accept those. Therefore, mw ("U") throws a warning Sequence 1 has unrecognized amino acid types. Output value might be wrong calculated. Actually, it throws the warning twice (?).
Maybe we should avoid the warning for U and O or use a specific one, like sequence contains x untypical amino acids (selenocystein/pyrrolysin).

Include oxidations in hydrophobicity scales

Hi,

when calculating peptide hydrophobicity, it would be nice to include the possibility of common modifications. For MS, especially oxidation of methionine would be of interest. M[ox] dramatically reduces the hydrophobicity of a peptide, as mentioned in Eichacker et al, 2004:

When compared with the Eisenberg scale (39), a single oxidation of methionine would reduce the hydrophobicity from 0.64 to –0.76 (Fig. 6A). A double oxidation would decrease the hydrophobicity of the peptide to –2.0 (Fig. 6A). Such a drastic shift of one single amino acid may contribute to the physical properties of the hydrophobic transmembrane peptide in a way that the peptide becomes detectable.

However, I am not sure, how we could provide this feature in the hydrophobicity function. Reading tags for oxidation / dioxidation like [+16] from the given sequence would be nice, but is so far not supported by any of the functions, I think. A workaround would be to use additional integer parameters like oxidated, dioxidated where the user can specify how many oxidations occur on the given peptide. For the hydrophobicity function, the exact localization of the oxidation would be rather insignificant.

Also, I do not know if all hydrophobicity scales define oxidations. Some research would be needed for correct implementation.

What is your opinion about this?

Greetings, Florian

Issue with mz

I tried to run the mz function today and got an error for any peptide not including "C":

> mz("AGHTTKILC")
[1] 500.7658
> mz("AGHTTKIL")
Error in table(unlist(strsplit(X, "")))[["C"]] : subscript out of bounds

Assuming I have the latest version (I did exit R and re-installed from CRAN), I would suggest the following change to the function's code:

function (seq, charge = 2, label = "none", aaShift = NULL, cysteins = 57.021464) {
    if (!is.numeric(charge) | length(charge) != 1) {
        stop("Charge must be given as an integer (typically between 1-4).")
    }
    mass <- mw(seq = seq, label = label, aaShift = aaShift, monoisotopic = TRUE)
    mass <- mass + nchar(gsub("[^C]", "", seq)) * cysteins #This is my suggested edit
    if (charge >= 0) {
        mass <- mass + charge * 1.007276
        mass <- mass/charge
    }
    return(mass)
}

I have no idea whether this is the most optimal/fastest code, or whether it makes incorrect assumptions as to the type of input sequences acceptable (I assumed upper case, one letter amino acid code as character). At least it doesn't fail because it assumes there should be at least one C, which the current version appears to do.

Stable isotope labelling function for mw()

Dear Peptides-Team,

Recently, I wrote a small package to calculate the mass weight of proteins that underwent isotope labelling (e.g. SILAC or 15N), like it is often used for protein analyses by LC-MS/MS. It also allows to calculate the m/z of Peptides, similar to one functionality in the EMBOSS tool peptide mass.

I am planning to integrate the functions into the Peptides package, if this is of interest for you. I shortly discussed with @dosorio that the additional functionality should be integrated directly into the mw() function. The m/z is calculated in a separate function, though.

If you disagree, please comment. I will shortly start to work on a pull request.

Can this package support analysing more than one peptide concurrently?

Thank you Dosorio for the useful package.

Within the existing package in R, is there a way to possibly input a text.file of peptide sequences (perhaps FASTA format) which can be analysed and a file outputted with the peptide analysis result as well as the run parameters and/or scales implemented for said run?

Would be great to take advantage of the mass analysis of peptides with the peptides package.

PhD Student (University of Nottingham)

Units of charge function

Hello,

I am currently using the package for my doctorate and would like to know the unit of measure that the 'charge' function returns the result. Was it Coulomb?

Thank you very much in advance!

MW function

El cálculo no respeta los enlaces peptidicos, el cálculo sobreestima el peso molecular

feature request

Hello,
I come from the world of small molecules and occasionally deal with peptides. Two requests for the very nice Peptides package. Given an AA sequence I can retrieve the MW of the peptide, but what would be better still is the (1) molecular formula and (2) some digital representation of the structure - SMILES would probably be easiest, though inchi would be an option too. I would guess that formula would be much easier than structure, but I figured I might as well place the request for both! Thanks for putting this together!
Corey

How to vectorise aacomp

I was taking a look into the aacomp function, and how to vectorize it. The big problem is that the output format doesn't scale well to examining lots of peptides. Since the output is currently a matrix, you could extend it to have a 3 dimensional array when the input is a character vector of sequences, but these are usually rather hard to read.

My preference would be to have a data frame with a column for the input sequence, and columns for each of the composition groups, letting the user choose whether they care about counts or percentages.

Here's my suggested implementation:

aacomp2 <- function(seq, metric = c("count", "percentage")){
  metric <- match.arg(metric)
  seq <- .remove_spaces(seq)
  # Classify amino acids in a particular class and sum the absolute frequencies
  regexes <- c(
    Tiny = "[ACGST]",
    Small = "[ABCDGNPSTV]",
    Aliphatic = "[AILV]",
    Aromatic = "[FHWY]",
    NonPolar = "[ACFGILMPVWY]",
    Polar = "[DEHKNQRSTZ]",
    Charged = "[BDEHKRZ]",
    Basic = "[HKR]",
    Acidic = "[BDEZ]"
  )
  y <- lapply(
    regexes,
    stri_count_regex,
    str = seq
  )
  if(metric == "percentage")
  {
    n_chars <- nchar(seq)
    y <- lapply(y, function(yi) round(100 * yi / n_chars, 3))
  }
  data.frame(seq = seq, y)
}

Usage is, e.g.,:

aacomp("KWKLFKKIGIGKFLHSAKKFX")
##                     seq Tiny Small Aliphatic Aromatic NonPolar Polar Charged Basic Acidic
## 1 KWKLFKKIGIGKFLHSAKKFX    4     4         5        5       11     9       8     8      0

aacomp("KWKLFKKIGIGKFLHSAKKFX", "percentage")
##                     seq   Tiny  Small Aliphatic Aromatic NonPolar  Polar Charged  Basic Acidic
## 1 KWKLFKKIGIGKFLHSAKKFX 19.048 19.048     23.81    23.81   52.381 42.857  38.095 38.095      0

The only downside is that it will break code by other users relying on the structure of the return from aacomp. If you think this is a problem, we may also need to provide a legacy mode that uses the existing behaviour.

Average ion mass

Dear Peptides Team,

I recognized that there is some controversy about the average mass of amino acid masses in different online sources.

While your package is currently using masses as given by Expasy, which are the same as given in Wikipedia.

However, other average masses are reported by the UWPR. The same average masses are reported by Mascot.

Does anyone know where these differences come from or where to find a definite source for the masses? @dosorio proposed "adding all the possible scales as arguments of the mw function would be the best."

mz function broken

Hi @dosorio .
Unfortunately, your changes on the mz function broke it for peptides without cysteines:

> mz("AYNVTQAFGR")
Fehler in table(unlist(strsplit(X, "")))[["C"]] : 
  Indizierung außerhalb der Grenzen
> mz("AYNVTQAFGRC")
[1] 643.8009

read.xvg arbitrarily cuts off column names after 6 columns

Using read.xvg() on xvg files with any more than 6 columns of data yields NA for additional column names beyond the sixth. It appears the parsing loop (lines 18–20) appears to be inappropriately using the nmax=min(P)+1 to set the upper cutoff for scraping out variable names. nmax sets the maximum number of criteria for scan() to continue looking for, not the maximal number of lines to go through. Similarly, even if it did, min(P)+1 would only allow for 3 headers to be scanned (first was initiated by min(P)-1).

The scan() can be amended to any of the following to properly deal with any additional columns (instead of cutting off prematurely:
scan(file,skip=min(P)-1,nmax=P1*4,what="",quiet=TRUE)[4*i]
scan(file,skip=min(P)-1,nlines=P1,what="",quiet=TRUE)[4*i]
scan(file,skip=min(P)-1,nlines=diff(range(P))+1,what="",quiet=TRUE)[4*i] (redundant to above)

Compilation fails

Hello Daniel,

I Seem to be having an issue on windows regarding the compilation of the package.

The error log is as follows:

> devtools::install()
√  checking for file 'C:\Users\admin\git\Peptides/DESCRIPTION' (685ms)
-  preparing 'Peptides': (2.1s)
√  checking DESCRIPTION meta-information ... 
-  cleaning src
-  checking for LF line-endings in source and make files and shell scripts (604ms)
-  checking for empty or unneeded directories
-  looking to see if a 'data/datalist' file should be added
-  building 'Peptides_2.4.1.tar.gz'
   
Running "C:/PROGRA~1/R/R-36~1.0/bin/x64/Rcmd.exe" INSTALL "C:\Users\admin\AppData\Local\Temp\RtmpU32l4u/Peptides_2.4.1.tar.gz" \
  --install-tests 
* installing to library 'C:/Users/admin/Documents/R/win-library/3.6'
* installing *source* package 'Peptides' ...
** using staged installation
** libs
-
*** arch - i386
C:/Rtools/mingw_32/bin/g++  -I"C:/PROGRA~1/R/R-36~1.0/include" -DNDEBUG  -I"C:/Users/admin/Documents/R/win-library/3.6/Rcpp/include"        -O2 -Wall  -mtune=generic -c RcppExports.cpp -o RcppExports.o
C:/Rtools/mingw_32/bin/g++  -I"C:/PROGRA~1/R/R-36~1.0/include" -DNDEBUG  -I"C:/Users/admin/Documents/R/win-library/3.6/Rcpp/include"        -O2 -Wall  -mtune=generic -c charge_pI.cpp -o charge_pI.o
charge_pI.cpp: In function 'int pKscales(std::string)':
charge_pI.cpp:10:21: warning: comparison between signed and unsigned integer expressions [-Wsign-compare]
   for (int i = 0; i < sizeof(scales)/sizeof(scales[0]); i++){
                     ^
charge_pI.cpp:15:10: warning: 'sScale' may be used uninitialized in this function [-Wmaybe-uninitialized]
   return sScale;
          ^
C:/Rtools/mingw_32/bin/gcc  -I"C:/PROGRA~1/R/R-36~1.0/include" -DNDEBUG  -I"C:/Users/admin/Documents/R/win-library/3.6/Rcpp/include"        -O3 -Wall  -std=gnu99 -mtune=generic -c init.c -o init.o
C:/Rtools/mingw_32/bin/g++ -shared -s -static-libgcc -o Peptides.dll tmp.def RcppExports.o charge_pI.o init.o -LC:/PROGRA~1/R/R-36~1.0/bin/i386 -lR
:init.c:(.rdata+0x64): undefined reference to `Peptides_absoluteCharge'
init.o:init.c:(.rdata+0x70): undefined reference to `Peptides_chargeList'
init.o:init.c:(.rdata+0x7c): undefined reference to `Peptides_RcppExport_registerCCallable'
collect2.exe: error: ld returned 1 exit status
no DLL was created
ERROR: compilation failed for package 'Peptides'
* removing 'C:/Users/admin/Documents/R/win-library/3.6/Peptides'
* restoring previous 'C:/Users/admin/Documents/R/win-library/3.6/Peptides'

I feel like the relevant line would be ...

:init.c:(.rdata+0x64): undefined reference to `Peptides_absoluteCharge'
init.o:init.c:(.rdata+0x70): undefined reference to `Peptides_chargeList'
init.o:init.c:(.rdata+0x7c): undefined reference to `Peptides_RcppExport_registerCCallable'

is there any particular compiler that would be required that I am not aware of ?

unfortunately , my knowledge of c/cpp is not enough to help with further debugging.

Kindest wishes,
Sebastian

Is the descriptor calculation a simple sum of each amino acid?

I confirmed that the charge etc. are calculated by simple sum

What the meaning of Blosum1-10

Hi!

Thanks for such a wonderful package.
I have a question of the function blosumIndices. After inputting a peptide sequence, I would get 10 values named Blosum1-10. What's the meaning of these values?

Hope your answer, it would help a lot!

Thanks!

Follow-up on the expansion of amino acids to include non-naturals.

Hi @dosorio

Thanks for such a wonderful package.

I'm working to generate lot of peptides mostly with non-natural amino acids. I was wondering if there is a possibility of expanding the list of amino acids to include new amino acids and their SMILES.
So that for the aaSMILES function peptides with non-naturals to be pass into to generate SMILES for them.
I envisage a situation where one letter amino acid name may be problematic.
Is there a way to this can be added. Maybe by using the 3-letter amino acid code rather than the 1 letter code.

It would help a lot!

Thanks!

Charge function

La carga calculada a pH X no es correspondiente con el punto isoeléctrico

peptide modifications

Hiya!

Thanks for such a wonderful package. I'm working with a lot of peptides that I order from GenScript. A lot of these have modifications or are non-natural amino acids (see here).

Is there any possibility to include these in the package? It would help a lot!

Thanks!

Assigning peptide N- or C-termini labels when calculating mw/mz

Hey there!

Awesome package, thank you!

I was wondering if aaShift within mz/mw could be extended for a function to label peptide-termini besides specific aminoacids?

This would be very useful for example when adding labels to the termini of peptides.

In the case of TMT-tags, they would be even cumulative with N-terminus and specific aminoacids such as K (a N-terminal K would carry two TMT-labels, one on the amino group of the free peptide terminus and one on free amino group in the sidechain).

Cheers,
Tobias

Feature request: R studio project + volunteer

Hi @dosorio,

Currently, RStudio is the most used R IDE.

I suggest to add an RStudio project file to the repo.

Sure, I volunteer to do so 👍

Feature request: get list of amino acids + volunteer

Hi @dosorio,

For me, Peptides has been a very helpful package. Thanks a lot!

Sadly, I am missing something simple (and in my context useful) in your package: a way to get a character vector of all amino acids.

I even see evidence in your code that you could also use it, for example here I see:

aaCheck <- function(seq){
  # ...
  check <- unlist(lapply(seq,function(sequence){
    !all(sequence%in%c("A" ,"C" ,"D" ,"E" ,"F" ,"G" ,"H" ,"I" ,"K" ,"L" ,"M" ,"N" ,"P" ,"Q" ,"R" ,"S" ,"T" ,"V" ,"W" ,"Y", "-"))
  }))
  # ...
}

I suggest to add a function to get all amino acids. Because it is so easy, I volunteer to write and test the function: all I need is your favorite function name for it.

Sure, it is easy to write such a function, but I feel the Peptides package is a natural place to put it.

Do you agree this simple function would be a worthy addition to Peptides?

Can we calculate the cyclic peptide?

I don't know how to write a cyclic peptide as a string, but can this library handle it?

dosorio / peptides Goto Github PK

peptides's Introduction

peptides's People

Contributors

Stargazers

Watchers

Forkers

peptides's Issues

Recommend Projects

Recommend Topics

Recommend Org