Code Monkey home page Code Monkey logo

universalmotif's Introduction

Bioc build status Bioc

universalmotif

This package allows for importing most common motif types into R for use by functions provided by other Bioconductor motif-related packages. Motifs can be exported into most major motif formats from various classes as defined by other Bioconductor packages. Furthermore, this package allows for easy manipulation of motifs, such as creation, trimming, shuffling, P-value calculations, filtering, type conversion, reverse complementation, alphabet switching, random motif site generation, and comparison. Alongside are also included functions for interacting with sequences, such as motif scanning and enrichment, as well as sequence creation and shuffling functions. Finally, this package implements higher-order motifs, allowing for more accurate sequence scanning and motif enrichment.

Installation

if (!requireNamespace("BiocManager", quietly=TRUE))
  install.packages("BiocManager")
BiocManager::install("universalmotif")
if (!requireNamespace("BiocManager", quietly=TRUE))
  install.packages("BiocManager")
BiocManager::install("bjmt/universalmotif")

Note: building the vignettes when installing from source is not recommended, unless you don't mind waiting an hour for the necessary dependencies to install.

Error when installing from source

If you trying to install the package from source and are getting compiler errors similar to these issues [1, 2, 3], then update your C++ compiler. This is an issue regarding incompatibilities between older compilers and the C++11 lambda functions from the RcppThread package, which is used by the universalmotif package.

Brief overview

All of the functions within the universalmotif package are fairly well documented. You can access the documentation from within R, reading the Bioconductor PDF, or browsing the rdrr.io website (the latter is not always up to date). Additionally, several vignettes come with the package, which you can access from within R or on the Bioconductor website:

You can also look through the slides of my Bioc2021 presentation, which goes over some basics of motif representations, scanning, and motif comparison.

A few key functions are also explored below.

The universalmotif motif class and import/export utilities

The universalmotif class is used to store the motif matrix itself, as well as other basic information such as alphabet, background frequencies, strand, and various other metadata slots. There are a number of ways of getting universalmotif class motifs:

  • Manual motif creation with create_motif() using one of several possible input types:
    • Consensus sequence
    • Sequence sites
    • Numeric matrix
    • No input: generate random motifs of any length

universalmotif class motifs are highly interoperable with other motif formats:

  • Import/export from/to several supported formats:
    • CIS-BP: read_cisbp()
    • HOMER: read_homer(), write_homer()
    • JASPAR: read_jaspar(), write_jaspar()
    • MEME: read_meme(), write_meme()
    • TRANSFAC: read_transfac(), write_transfac()
    • UNIPROBE: read_uniprobe()
    • Generic matrices: read_matrix(), write_matrix()
  • Conversion from/to several compatible Bioconductor package motif classes using convert_motifs() (some formats cannot go both ways; see the documentation for details):
    • TFBSTools: PFMatrix, PWMatrix, ICMatrix, PFMatrixList, PWMatrixList, ICMatrixList, TFFMFirst
    • MotifDb: MotifList
    • seqLogo: pwm
    • motifStack: pcm, pfm
    • PWMEnrich: PWM
    • motifRG: Motif
    • Biostrings: PWM
    • rGADEM: motif
library(universalmotif)

create_motif()
#>
#>        Motif name:   motif
#>          Alphabet:   DNA
#>              Type:   PPM
#>           Strands:   +-
#>          Total IC:   11.46
#>         Consensus:   YGTGMMMRGA
#>
#>      Y G    T    G    M    M    M    R    G    A
#> A 0.17 0 0.00 0.04 0.58 0.62 0.29 0.47 0.08 0.77
#> C 0.36 0 0.01 0.00 0.41 0.36 0.68 0.16 0.05 0.00
#> G 0.00 1 0.03 0.95 0.00 0.00 0.04 0.28 0.86 0.23
#> T 0.47 0 0.96 0.02 0.00 0.03 0.00 0.09 0.00 0.00

See ?universalmotif for a list of available metadata slots. Most slots can be accessed using square brackets, e.g. MotifObject["motif"] accesses the raw numeric matrix. You can also dump the contents of all user-facing motif slots at once into a list, e.g. MotifObject[].

Sequence creation, shuffling and background calculation

An important aspect of motif scanning and enrichment is to compare the results with those from a set of random or background sequences. For this, two functions are provided:

  • create_sequences(): create sequences of any alphabet, with optional desired background frequencies
  • shuffle_sequences(): shuffle a set of sequences, preserving any size k-let
library(universalmotif)

seqs <- create_sequences()

seqs
#>   A DNAStringSet instance of length 100
#>       width seq
#>   [1]   100 AGTACGTTCGCATGGCAGGCATTATTTGCGCTG...TATCAGCCTAGAAGCAGGCGTACCAAGGTCTA
#>   [2]   100 AATATCGGGCGCGAAGCCCGATGCGTGCTCGGA...GATGCAGTTCAAACGAAATCTCGTAAACGTGA
#>   [3]   100 AGTACAGCAATGGGGACATAAGCCGTCTCATCG...CATAGTTCTCGAAATATGAATCTCCAGTCCCA
#>   [4]   100 CAGATGCACTATCACCGTGCCGAGCTCGGTAAC...AATCGCATTGAACTAACAGGGGAGCAAGATAA
#>   [5]   100 CGGCCCCTGGGACGTTGGATCCAGATAAAGCTT...TATGTTCCTTGCCGGAATACGGCACATATCTC
#>   ...   ... ...
#>  [96]   100 CGGTGCAAAATGTGCCGCACACGGTAGTGCGGG...TTACACGCGTCTTTCGGAGAATGAGCTCGGCA
#>  [97]   100 CAGTTAATCTATTAATGAGTCACTTAGGATTCC...GTTGCTTGGATATGGGAGAGAATGGCCAGTAA
#>  [98]   100 GGGTCGTTGGCAGGGATGCACACAGACACGAAT...GTTTGCAAGACAACAGTAGCTAATTGTGCCAA
#>  [99]   100 GCCTTCGGACGCCAAGTCTGCAAACAATTCCTC...CTTCTACGCCAAAACTCTTATCCCTGGCATTC
#> [100]   100 GTCACAGCCAAGCTTTAAGTCTTCCAACCAGGA...ATTGTGGACGGAAGGTACCGTCGTAGATTCGC

seqs.shuffled <- shuffle_sequences(seqs, k = 3)

Additionally, if you are interested in the detailed k-mer content of you sequences you can use get_bkg(). It can be used to calculate sequence background for any size k-mer, and for any sequence alphabet. Results can be shown for individual sequences or merged together. There is also an option to calculate these results in any size windows (with any size overlap between windows) across the sequences.

library(universalmotif)

data(ArabidopsisPromoters)

get_bkg(ArabidopsisPromoters, merge.res = FALSE)
#> DataFrame with 4200 rows and 4 columns
#>         sequence        klet     count probability
#>      <character> <character> <integer>   <numeric>
#> 1      AT4G28150           A       318       0.318
#> 2      AT1G19380           A       309       0.309
#> 3      AT4G19520           A       325       0.325
#> 4      AT1G03850           A       338       0.338
#> 5      AT5G01810           A       317       0.317
#> ...          ...         ...       ...         ...
#> 4196   AT5G22690         TTT        36   0.0360721
#> 4197   AT1G05670         TTT        43   0.0430862
#> 4198   AT1G06160         TTT        56   0.0561122
#> 4199   AT5G24660         TTT        43   0.0430862
#> 4200   AT3G19200         TTT        34   0.0340681

get_bkg(ArabidopsisPromoters, window = TRUE)
#> DataFrame with 840 rows and 5 columns
#>         start      stop        klet     count probability
#>     <numeric> <numeric> <character> <integer>   <numeric>
#> 1           1       100           A      1604      0.3208
#> 2         101       200           A      1636      0.3272
#> 3         201       300           A      1773      0.3546
#> 4         301       400           A      1791      0.3582
#> 5         401       500           A      1716      0.3432
#> ...       ...       ...         ...       ...         ...
#> 836       501       600         TTT       255   0.0520408
#> 837       601       700         TTT       269   0.0548980
#> 838       701       800         TTT       233   0.0475510
#> 839       801       900         TTT       255   0.0520408
#> 840       901      1000         TTT       271   0.0553061

Sequence scanning and higher order motifs

The universalmotif package provides the scan_sequences() function to quickly scan a set of input sequences for motif hits. Additionally, the add_multifreq() function can be used to generate higher order motifs. These can also be used to scan sequences with higher accuracy. By default scan_sequences() calculates a threshold cutoff from a P-value, though this can be changed to a manual logodds threshold.

library(universalmotif)
library(Biostrings)
data(ArabidopsisPromoters)

seqs <- DNAStringSet(rep(c("CAAAACC", "CTTTTCC"), 3))
motif <- create_motif(seqs, pseudocount = 1)

scan_sequences(motif, ArabidopsisPromoters)
#> DataFrame with 53 rows and 14 columns
#>           motif   motif.i    sequence     start      stop     score       match
#>     <character> <integer> <character> <integer> <integer> <numeric> <character>
#> 1         motif         1   AT1G03850       203       209      9.08     CTAATCC
#> 2         motif         1   AT1G06160       956       962      9.08     CTAATCC
#> 3         motif         1   AT1G07490       472       478      9.08     CTTAACC
#> 4         motif         1   AT1G07490       936       942      9.08     CATTTCC
#> 5         motif         1   AT1G19380       139       145      9.08     CTTATCC
#> ...         ...       ...         ...       ...       ...       ...         ...
#> 49        motif         1   AT5G20200       430       436      9.08     CAATTCC
#> 50        motif         1   AT5G22690        81        87      9.08     CAATACC
#> 51        motif         1   AT5G22690       362       368      9.08     CAAATCC
#> 52        motif         1   AT5G58430       332       338      9.08     CATAACC
#> 53        motif         1   AT5G58430       343       349      9.08     CAAATCC
#>     thresh.score min.score max.score score.pct      strand      pvalue    qvalue
#>        <numeric> <numeric> <numeric> <numeric> <character>   <numeric> <numeric>
#> 1           9.08   -19.649      9.08       100           + 0.000976562  0.915758
#> 2           9.08   -19.649      9.08       100           + 0.000976562  0.915758
#> 3           9.08   -19.649      9.08       100           + 0.000976562  0.915758
#> 4           9.08   -19.649      9.08       100           + 0.000976562  0.915758
#> 5           9.08   -19.649      9.08       100           + 0.000976562  0.915758
#> ...          ...       ...       ...       ...         ...         ...       ...
#> 49          9.08   -19.649      9.08       100           + 0.000976562  0.915758
#> 50          9.08   -19.649      9.08       100           + 0.000976562  0.915758
#> 51          9.08   -19.649      9.08       100           + 0.000976562  0.915758
#> 52          9.08   -19.649      9.08       100           + 0.000976562  0.915758
#> 53          9.08   -19.649      9.08       100           + 0.000976562  0.915758

motif.k2 <- add_multifreq(motif, seqs, add.k = 2)
scan_sequences(motif.k2, ArabidopsisPromoters, use.freq = 2, threshold = 1e-6)
#> DataFrame with 8 rows and 14 columns
#>         motif   motif.i    sequence     start      stop     score       match
#>   <character> <integer> <character> <integer> <integer> <numeric> <character>
#> 1       motif         1   AT1G19510       960       965    17.827      CTTTTC
#> 2       motif         1   AT1G49840       959       964    17.827      CTTTTC
#> 3       motif         1   AT1G77210       184       189    17.827      CAAAAC
#> 4       motif         1   AT1G77210       954       959    17.827      CAAAAC
#> 5       motif         1   AT2G37950       751       756    17.827      CAAAAC
#> 6       motif         1   AT3G57640       917       922    17.827      CTTTTC
#> 7       motif         1   AT4G12690       938       943    17.827      CAAAAC
#> 8       motif         1   AT4G14365       977       982    17.827      CTTTTC
#>   thresh.score min.score max.score score.pct      strand      pvalue    qvalue
#>      <numeric> <numeric> <numeric> <numeric> <character>   <numeric> <numeric>
#> 1       17.827   -16.842    17.827       100           + 1.90735e-06 0.0118494
#> 2       17.827   -16.842    17.827       100           + 1.90735e-06 0.0118494
#> 3       17.827   -16.842    17.827       100           + 1.90735e-06 0.0118494
#> 4       17.827   -16.842    17.827       100           + 1.90735e-06 0.0118494
#> 5       17.827   -16.842    17.827       100           + 1.90735e-06 0.0118494
#> 6       17.827   -16.842    17.827       100           + 1.90735e-06 0.0118494
#> 7       17.827   -16.842    17.827       100           + 1.90735e-06 0.0118494
#> 8       17.827   -16.842    17.827       100           + 1.90735e-06 0.0118494

Note the differences between the matching sequences of regular scanning versus higher order scanning.

Motif comparison, merging and viewing

A commonly performed task after de novo motif discovery is to check how closely it might resemble known motifs. This can be performed using the highly customizable compare_motifs() with one of several available metrics. Different motifs can also be merged with merge_motifs(). In addition to motif visualization, view_motifs() can also be used to examine the top-scoring alignment chosen by compare_motifs() and merge_motifs().

library(universalmotif)

new.motif <- create_motif("CGCGAAAAAA", name = "New motif")
old.motif <- create_motif("TATATTTTTT", name = "Old motif")

Using very strict alignment parameters, such as no overhangs:

compare_motifs(c(new.motif, old.motif), method = "PCC", min.overlap = 10)[2]
#> [1] 0.2

merged.motif <- merge_motifs(c(new.motif, old.motif), method = "PCC",
    new.name = "Merged motif", min.overlap = 10)

view_motifs(c(new.motif, old.motif, merged.motif), method = "PCC",
    min.overlap = 10)

After relaxing the alignment parameters:

compare_motifs(c(new.motif, old.motif), method = "PCC", min.overlap = 5)[2]
#> [1] 1

merged.motif <- merge_motifs(c(new.motif, old.motif), method = "PCC",
    new.name = "Merged motif", min.overlap = 5)

view_motifs(c(new.motif, old.motif, merged.motif), method = "PCC",
    min.overlap = 5)

By default compare_motifs() returns a numeric matrix, meaning the output from comparisons between large numbers of motifs can be easily used to generate heatmaps or dendrograms.

universalmotif's People

Contributors

bjmt avatar hpages avatar jwokaty avatar nturaga avatar snystrom avatar vobencha avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

universalmotif's Issues

`create_motif` makes incorrect motif for amino acid sequences

What happens

Given a list of sequences (as AAStringSet), create_motif returns an obviously wrong PWM and consensus. It appears that there are some issues with inconsistent ordering of amino acid label values.

What I suspect might be causing the problem

create_motif creates a matrix with row labels from Biostring's AA_STANDARD. This is a list of single-letter amino acid codes that are in the order:
"A" "R" "N" "D" "C" "Q" "E" "G" "H" "I" "L" "K" "M" "F" "P" "S" "T" "W" "Y" "V"

Later manipulations of this matrix seem to expect the order to be alphabetical.

How to reproduce

library(universalmotif)
library(Biostrings)

sequences <- AAStringSet(c("VTTDLQVKV", "STSDLLTLR", "TSLHLLVLR", "QALELLPRL", "LTDTLVSKL", 
  "TSLHLVLRL", "TSLRLLTSL", "LSTPVLRFT", "APEEHPVLL", "GSSDFLVKL", 
  "VTFLLPAGW", "LTSELLTHL", "TSSSLLLLR", "LSTEVNPKL", "QSLPTKETL", 
  "LLDPHVVLL", "SGLVLKVLL", "LTAHVEPLL", "STVKVLLRL", "FLDTVLLSW", 
  "LSKALVAYY", "KASSLVPKL", "LTADLARVL", "SGTDRQVTL", "TFDVALSPR", 
  "EDFTLLVNL", "FDDVAVVTF", "SGAYLKVSL", "LWDLSLLTR", "LTTKALYRN", 
  "GVAPLQVVK", "FFDPVTLHL", "LVSALQLLL", "TESKYYVTL", "LFDLFRFGF", 
  "LSVPLFKQF", "KRTLLDVVY", "KSFEAPLLK", "TTTPQQTKL", "SAADLPLNL", 
  "VSSKLLLVL", "QSLPTKETL", "VTLFKVAAP", "LTAHVEPLL", "LDVRYLLDL", 
  "TTGTLLKTL", "MLLDVYLTL", "SGLVVLKLL", "KSTDVFTTF", "LTAQHKLMA", 
  "HFDLLLRVN", "KALDSSKTF", "YNDEALLLR", "KSLTLTPQL", "YTRYGPKAF", 
  "MVAKKPNLL", "YQPDFYFEF", "KDLLMVPTF", "FSLPWRSST", "LPDSSPRTL", 
  "SMAALFVLL", "PELEVKVTV", "KTPVKVPVL", "LKLLLGLLL", "VLTTKLLVL", 
  "LSQRKSTSL", "KTTPDVLFV", "LEELSKYLF", "QSLPLFVQL", "KDTKTLVLL", 
  "HGFFLPEKL", "KLYYQEFKK", "HSLTEDVTL", "ASSTNLLHL", "NDAYLVQGL", 
  "STLLKFEAA", "HSAELLAEL", "PDLLTKLTF", "TFTKTQETL", "LSGRLLTVL", 
  "KPEVVFLLL", "KGFVGSFLV", "KAVDTSKTF", "FDDTTFGTF", "VQVVLMLLL", 
  "VALAKSLYY", "TAHDLLAEL", "KAAKKAPLN", "SYVKLLLSY", "LPLFVSLDL", 
  "VNFLVLVRT", "FLKAPLLFL", "TLPHLSESF", "TPHDPTVPL", "VDGKTLVNV", 
  "KLTSEVLNL", "EPFVLPLTW", "VTDLHKTSL", "KQKWLALLK", "MVAKKPNLL", 
  "SLRNVKVTL", "QNTLAVPEL", "PSPFAALVH", "HTFWGVVFF", "KSDVFLTEL", 
  "FTDARAYTT", "LTERFTLVF", "KGTSTTHLL", "VGNLRALVR", "KEASLQLVL", 
  "PVTTKPVTL", "KNASLYLLV", "HTAELVLVL", "TYDLQESNV", "TTATQVLLL", 
  "KDGLFWVLV", "VSTGLVKLR", "KANEKLAVK", "LVSVQVVLV", "VAKVNAYTF", 
  "KAAELQTGL", "QVKFAGVKL", "FDDDSKLFW", "AVRMVGLQL", "LSNVAYPVL", 
  "MFDDTELLF", "KLTLTEVEP", "YRSLGPALR", "LLARASLLL", "LTNSSTVTL", 
  "GTDLASFNL", "LCNAKLYLF", "KLEDFAFTF", "HTNALQTLL", "KVDSVYYLF", 
  "VVKAKVNAP", "TTLLKEVEP", "ASLPRSVLF", "WLAWSTFGE", "LTDGYKLTL", 
  "QGYEKLVEV", "KDGFTLFYF", "LWPLLAVAL", "LDPLSVKTF", "KVQQYAVKL", 
  "TDELSPHLL", "VDFLLATWF", "CYGRSVLNY", "GSTERNVTL", "KDEVYYVKL", 
  "KNKAAVLQL", "TDDYMELLF", "TTTLAKVEV", "FLGKALFFL", "LPDMSQPLW", 
  "KTRTEVSQY", "TEPEYLTEY", "HATTQNVLL", "KSDVFTLEL"))

Compare the output of create_motif with that of consensusMatrix:

create_motif(sequences, alphabet="AA", type="PWM")
consensusMatrix(sequences)

Note that I'm using v1.4.0 of universalmotif, but I don't think this issue has been addressed subsequently.

sessionInfo()

R version 3.6.1 (2019-07-05)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS Mojave 10.14.5

Matrix products: default
BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats4 parallel stats graphics grDevices utils datasets methods base

other attached packages:
[1] Biostrings_2.54.0 XVector_0.26.0 IRanges_2.20.0 S4Vectors_0.24.0 BiocGenerics_0.32.0 universalmotif_1.4.0

loaded via a namespace (and not attached):
[1] treeio_1.10.0 tidyselect_0.2.5 remotes_2.1.0 purrr_0.3.3 lattice_0.20-38 colorspace_1.4-1
[7] vctrs_0.2.0 yaml_2.2.0 rlang_0.4.1 pkgbuild_1.0.6 pillar_1.4.2 glue_1.3.1
[13] withr_2.1.2 rvcheck_0.1.6 lifecycle_0.1.0 stringr_1.4.0 ggseqlogo_0.1 zlibbioc_1.32.0
[19] munsell_0.5.0 gtable_0.3.0 callr_3.3.2 ps_1.3.0 gbRd_0.4-11 curl_4.2
[25] Rcpp_1.0.2 scales_1.0.0 backports_1.1.5 BiocManager_1.30.9 jsonlite_1.6 ggplot2_3.2.1
[31] packrat_0.5.0 stringi_1.4.3 processx_3.4.1 dplyr_0.8.3 grid_3.6.1 rprojroot_1.3-2
[37] bibtex_0.4.2 ggtree_2.0.0 Rdpack_0.11-0 cli_1.1.0 tools_3.6.1 magrittr_1.5
[43] lazyeval_0.2.2 tibble_2.1.3 crayon_1.3.4 ape_5.3 tidyr_1.0.0 pkgconfig_2.0.3
[49] zeallot_0.1.0 MASS_7.3-51.4 tidytree_0.2.9 prettyunits_1.0.2 assertthat_0.2.1 rstudioapi_0.10
[55] R6_2.4.0 nlme_3.1-141 compiler_3.6.1

read_meme fails when alphabet is DNA/RNA/AA-LIKE or custom

dreme.txt

> read_meme("dreme.txt")
Error in strsplit(alph, "\\s+")[[1]] : subscript out of bounds

This happens when the ALPHABET string defines a DNA/RNA/AA-LIKE or custom alphabet.

A few outcomes could happen here:

  1. Throw informative error for both *-LIKE alphabets and custom alphabets
  2. use *-LIKE Alphabets as DNA/RNA/AA
  3. throw an error for custom alphabets
  4. fully support custom alphabets

4 is quite the undertaking, so I've implemented 2 & 3. PR incoming.

Error with read_meme

Hi!

First of all, congratulations for the package. It is really useful.

After run_meme (MEME v. 5.05, 2400 sequences, 7000 background sequences, and trying with either objfun de and se), I always get the same error:

Error in universalmotif_cpp(name = x, type = "PPM", altname = x2, nsites = y[1], :
'bkg' vector is too short

MEME runs fine, the issue seems to be with read_meme. Any ideas? Is there something I may be doing wrong?

EDIT: used run_meme with classic mode, only sequences with no control, and get the same mistake.
The devel version also gives the same error.

I attach sessioninfo():

R version 3.6.1 (2019-07-05)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 18.04.3 LTS

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.7.1
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.7.1

Random number generation:
 RNG:     Mersenne-Twister 
 Normal:  Inversion 
 Sample:  Rounding 
 
locale:
 [1] LC_CTYPE=es_ES.UTF-8       LC_NUMERIC=C               LC_TIME=es_ES.UTF-8        LC_COLLATE=es_ES.UTF-8     LC_MONETARY=es_ES.UTF-8   
 [6] LC_MESSAGES=es_ES.UTF-8    LC_PAPER=es_ES.UTF-8       LC_NAME=C                  LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=es_ES.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats4    parallel  stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] universalmotif_1.3.65 GrpString_0.3.2       Biostrings_2.53.2     XVector_0.25.0        IRanges_2.19.16       S4Vectors_0.23.25     BiocGenerics_0.31.6  

loaded via a namespace (and not attached):
 [1] treeio_1.9.3       gtools_3.8.1       tidyselect_0.2.5   purrr_0.3.2        lattice_0.20-38    colorspace_1.4-1   vctrs_0.2.0        htmltools_0.4.0   
 [9] yaml_2.2.0         rlang_0.4.0        later_1.0.0        pillar_1.4.2       glue_1.3.1         lifecycle_0.1.0    rvcheck_0.1.5      plyr_1.8.4        
[17] stringr_1.4.0      ggseqlogo_0.1      zlibbioc_1.31.0    munsell_0.5.0      gtable_0.3.0       ps_1.3.0           httpuv_1.5.2       gbRd_0.4-11       
[25] crosstalk_1.0.0    Rcpp_1.0.2         xtable_1.8-4       promises_1.1.0     scales_1.0.0       backports_1.1.5    BiocManager_1.30.7 jsonlite_1.6      
[33] mime_0.7           ggplot2_3.2.1      digest_0.6.21      stringi_1.4.3      processx_3.4.1     dplyr_0.8.3        shiny_1.3.2        grid_3.6.1        
[41] bibtex_0.4.2       ggtree_1.99.1      Rdpack_0.11-0      tools_3.6.1        magrittr_1.5       lazyeval_0.2.2     tibble_2.1.3       crayon_1.3.4      
[49] ape_5.3            tidyr_1.0.0        pkgconfig_2.0.3    zeallot_0.1.0      tidytree_0.2.8     assertthat_0.2.1   rstudioapi_0.10    R6_2.4.0          
[57] nlme_3.1-141       compiler_3.6.1 

Thanks a lot :)

importing pwd by read_cisbp

Hi,

Thank you for write this package!
I am having troubles importing pwms data from cis_bp database. I already tried with both, the Bioconductor and the Github version.
What I am looking for a way to obtain the concensus sequences of all the downloaded pwms files from cis_bp database. One example:

read_cisbp("M11070_2.00.txt")

I am gettting the following warning:

Error in mapply(function(x, y) raw_lines[x:y], meta_starts, meta_stops, :
zero-length inputs cannot be mixed with those of non-zero length

How M11070_2.00.txt file looks?

Pos A C G T
1 0.304347826086957 0.0869565217391304 0.565217391304348 0.0434782608695652
2 0.0 0.0 0.0 1.0
3 1.0 0.0 0.0 0.0
4 1.0 0.0 0.0 0.0
5 0.0 1.0 0.0 0.0
6 0.0434782608695652 0.565217391304348 0.326086956521739 0.0652173913043478
7 0.0 0.0 1.0 0.0
8 0.108695652173913 0.173913043478261 0.0 0.717391304347826
9 0.0888888888888889 0.177777777777778 0.0 0.733333333333333
10 0.0666666666666667 0.0444444444444444 0.0 0.888888888888889
11 0.146341463414634 0.170731707317073 0.024390243902439 0.658536585365854
12 0.292682926829268 0.170731707317073 0.024390243902439 0.51219512195122
13 0.275 0.1 0.075 0.55

Could you please help me to find the issue here?
thank you!

validObject_universalmotif(motifs) fails when strand is "*"

Hi All,

I've just run across this error where the stand information must not be "", however, convert_motifs initially generated the "" when converting a matrix to a PFMatrix, see example below.

R> tmp <- matrix(1:20, nrow = 4, dimnames = list(c("A", "C", "T", "G")))
R> tmp
  [,1] [,2] [,3] [,4] [,5]
A    1    5    9   13   17
C    2    6   10   14   18
T    3    7   11   15   19
G    4    8   12   16   20

R> convert_motifs(tmp, "TFBSTools-PFMatrix")
An object of class PFMatrix
ID: 
Name: motif
Matrix Class: Unknown
strand: *
Tags: 
list()
Background: 
   A    C    G    T 
0.25 0.25 0.25 0.25 
Matrix: 
  N N N N N
A 1 2 2 2 2
C 2 2 2 2 2
G 4 3 3 3 3
T 3 3 3 3 3

R> convert_motifs(convert_motifs(tmp, "TFBSTools-PFMatrix"))
Error in validObject_universalmotif(motifs) : 
* strand must be one of +, -, +-

Thanks for the help

-Dave

Error in convert_motifs for TFBStools-PFMatrixList

I am trying to query jaspar for all the Homo Sapiens TFs, but when I try to convert the PFMatrixList but I get the following error.

Error in if (mot_classes == classin) return(motifs) :
missing value where TRUE/FALSE needed

library(JASPAR2018)
library(TFBSTools)
library(universalmotif)

opts <- list()

opts[["species"]] <- 9606

opts[["all_versions"]] <- TRUE

PFMatrixList.jaspar <- getMatrixSet(JASPAR2018, opts)

test <- convert_motifs(PFMatrixList.jaspar)

Option to deprotect `motif` column in universalmotif_df

I was just messing around with a set of motifs and I wanted to selectively RC one while using a universalmotif_df.

df <- to_df(c(create_motif, create_motif(name = "motif2")))

df %>%
   mutate(motif = ifelse(name == "motif2", motif_rc(motif), motif))

Of course, this doesn't work because of the AsIs class.

So this got me thinking if there could be a way for folks to deprotect the motif for cases like this. Users doing this would know the operation is unsafe. Here's the silly implementation I came up with:

edit_motif <- function(m){
  class(m) <- NULL
  m 
}

So the above becomes:

df <- to_df(c(create_motif, create_motif(name = "motif2")))

df %>%
   mutate(motif = ifelse(name == "motif2", motif_rc(edit_motif(motif)), edit_motif(motif)))

Not sure if this is the perfect solution. For instance, this won't keep the AsIs attribute after the mutate, so maybe a macro for wrapping the whole ifelse operation to deprotect could work? I'll do some more thinking on this, but figured I'd post it now.

read_meme() yields out that "alphabet type cannot be detected" although "ALPHABET= ACGT" is provided

Hi there,

I'm new to this package.

When I tried to import motif from MEME, I got error information that

Error in get_custom_meme_alph(raw_lines) :
Alphabet type cannot be detected, custom alphabets are not currently supported

However my meme motif is in correct format where

ALPHABET= ACGT

My code is

motif_1 <- read_meme(system.file("/Users/Desktop/motif/pssm/up150_motif_1_pssm.txt",
                                           package = "universalmotif"))

When I tried with this code below

motif_1 <- read_meme("/Users/Desktop/motif/pssm/up150_motif_1_pssm.txt")

It returned that

Warning message:
In readLines(con <- file(file)) :
incomplete final line found on '/Users/zhangruixuan/Desktop/medusavirus_paper/motif/pssm/up150_motif_1_pssm.txt'

I'm not sure whether this warning message meaning, the last part of motif format is letter-probability matrix and I'm sure it is complete. Why did this warning come out?

Could you give me some advices? Thank you in advance.

shuffle_sequence truncated

Hi, I am trying to use the shuffle_sequence function from the universalmotif package in order to shuffle a data containing ~19300 fasta sequeunces of 50 nt length. When I run my script using jus 10 sequences it works perfectly, but when I tried to used with the big data I get this error message :
fastaFile = readDNAStringSet("m6A_5-3-sequences.fasta")
euler <- shuffle_sequences(fastaFile, k = 2, method = "euler")

Error in edgematrix[i, alph.i[lastlets[[i]]]] :
incorrect number of dimensions

error in install

BiocManager::install("universalmotif")

Bioconductor version 3.12 (BiocManager 1.30.10), R 4.0.2 (2020-06-22) Installing package(s) 'universalmotif' trying URL 'https://bioconductor.org/packages/3.12/bioc/src/contrib/universalmotif_1.8.3.tar.gz' Content type 'application/x-gzip' length 3732415 bytes (3.6 MB) ================================================== downloaded 3.6 MB  * installing *source* package ‘universalmotif’ ... ** using staged installation ** libs g++ -std=gnu++11 -I"/usr/local/lib64/R/include" -DNDEBUG  -I'/home/zhou/Rlib/Rcpp/include' -I'/home/zhou/Rlib/RcppThread/include' -I/usr/local/include   -fpic  -g -O2  -c RcppExports.cpp -o RcppExports.o g++ -std=gnu++11 -I"/usr/local/lib64/R/include" -DNDEBUG  -I'/home/zhou/Rlib/Rcpp/include' -I'/home/zhou/Rlib/RcppThread/include' -I/usr/local/include   -fpic  -g -O2  -c add_multifreq.cpp -o add_multifreq.o g++ -std=gnu++11 -I"/usr/local/lib64/R/include" -DNDEBUG  -I'/home/zhou/Rlib/Rcpp/include' -I'/home/zhou/Rlib/RcppThread/include' -I/usr/local/include   -fpic  -g -O2  -c compare_motifs.cpp -o compare_motifs.o In file included from /home/zhou/Rlib/RcppThread/include/RcppThread.h:11:0,                  from compare_motifs.cpp:2: /home/zhou/Rlib/RcppThread/include/RcppThread/Thread.hpp: In lambda function: /home/zhou/Rlib/RcppThread/include/RcppThread/Thread.hpp:42:19: error: parameter packs not expanded with ‘...’:                  f(args...);                    ^ /home/zhou/Rlib/RcppThread/include/RcppThread/Thread.hpp:42:19: note:         ‘args’ /home/zhou/Rlib/RcppThread/include/RcppThread/Thread.hpp:42:23: error: expansion pattern ‘args’ contains no argument packs                  f(args...);                        ^ In file included from /home/zhou/Rlib/RcppThread/include/RcppThread.h:13:0,                  from compare_motifs.cpp:2: /home/zhou/Rlib/RcppThread/include/RcppThread/ThreadPool.hpp: In member function ‘void RcppThread::ThreadPool::push(F&&, Args&& ...)’: /home/zhou/Rlib/RcppThread/include/RcppThread/ThreadPool.hpp:129:31: error: expected ‘,’ before ‘...’ token          jobs_.emplace([f, args...] { f(args...); });                                ^ /home/zhou/Rlib/RcppThread/include/RcppThread/ThreadPool.hpp:129:31: error: expected identifier before ‘...’ token /home/zhou/Rlib/RcppThread/include/RcppThread/ThreadPool.hpp:129:34: error: parameter packs not expanded with ‘...’:          jobs_.emplace([f, args...] { f(args...); });                                   ^ /home/zhou/Rlib/RcppThread/include/RcppThread/ThreadPool.hpp:129:34: note:         ‘args’ /home/zhou/Rlib/RcppThread/include/RcppThread/ThreadPool.hpp: In lambda function: /home/zhou/Rlib/RcppThread/include/RcppThread/ThreadPool.hpp:129:44: error: expansion pattern ‘args’ contains no argument packs          jobs_.emplace([f, args...] { f(args...); });                                             ^ /home/zhou/Rlib/RcppThread/include/RcppThread/ThreadPool.hpp: In member function ‘std::future<decltype (f(args ...))> RcppThread::ThreadPool::pushReturn(F&&, Args&& ...)’: /home/zhou/Rlib/RcppThread/include/RcppThread/ThreadPool.hpp:146:54: error: expected ‘,’ before ‘...’ token      auto job = std::make_shared<jobPackage>([&f, args...] {                                                       ^ /home/zhou/Rlib/RcppThread/include/RcppThread/ThreadPool.hpp:146:54: error: expected identifier before ‘...’ token /home/zhou/Rlib/RcppThread/include/RcppThread/ThreadPool.hpp:146:57: error: parameter packs not expanded with ‘...’:      auto job = std::make_shared<jobPackage>([&f, args...] {                                                          ^ /home/zhou/Rlib/RcppThread/include/RcppThread/ThreadPool.hpp:146:57: note:         ‘args’ /home/zhou/Rlib/RcppThread/include/RcppThread/ThreadPool.hpp: In lambda function: /home/zhou/Rlib/RcppThread/include/RcppThread/ThreadPool.hpp:147:22: error: expansion pattern ‘args’ contains no argument packs          return f(args...);                       ^ /home/zhou/Rlib/RcppThread/include/RcppThread/ThreadPool.hpp: In instantiation of ‘struct RcppThread::ThreadPool::push(F&&, Args&& ...) [with F = RcppThread::ThreadPool::parallelFor(ptrdiff_t, ptrdiff_t, F&&, size_t) [with F = compare_motifs_cpp(const List&, const std::vector<int>&, const std::vector<int>&, const string&, double, bool, std::vector<std::vector<double> >&, int, bool, double, bool, int, double, const std::vector<double>&, const string&)::__lambda13&; ptrdiff_t = long int; size_t = long unsigned int]::__lambda8&; Args = {const RcppThread::Batch&}]::__lambda5’: /home/zhou/Rlib/RcppThread/include/RcppThread/ThreadPool.hpp:129:9:   required from ‘void RcppThread::ThreadPool::push(F&&, Args&& ...) [with F = RcppThread::ThreadPool::parallelFor(ptrdiff_t, ptrdiff_t, F&&, size_t) [with F = compare_motifs_cpp(const List&, const std::vector<int>&, const std::vector<int>&, const string&, double, bool, std::vector<std::vector<double> >&, int, bool, double, bool, int, double, const std::vector<double>&, const string&)::__lambda13&; ptrdiff_t = long int; size_t = long unsigned int]::__lambda8&; Args = {const RcppThread::Batch&}]’ /home/zhou/Rlib/RcppThread/include/RcppThread/ThreadPool.hpp:201:9:   required from ‘void RcppThread::ThreadPool::parallelFor(ptrdiff_t, ptrdiff_t, F&&, size_t) [with F = compare_motifs_cpp(const List&, const std::vector<int>&, const std::vector<int>&, const string&, double, bool, std::vector<std::vector<double> >&, int, bool, double, bool, int, double, const std::vector<double>&, const string&)::__lambda13&; ptrdiff_t = long int; size_t = long unsigned int]’ /home/zhou/Rlib/RcppThread/include/RcppThread/parallelFor.hpp:48:5:   required from ‘void RcppThread::parallelFor(ptrdiff_t, ptrdiff_t, F&&, size_t, size_t) [with F = compare_motifs_cpp(const List&, const std::vector<int>&, const std::vector<int>&, const string&, double, bool, std::vector<std::vector<double> >&, int, bool, double, bool, int, double, const std::vector<double>&, const string&)::__lambda13; ptrdiff_t = long int; size_t = long unsigned int]’ compare_motifs.cpp:1426:18:   required from here /home/zhou/Rlib/RcppThread/include/RcppThread/ThreadPool.hpp:129:27: error: using invalid field ‘RcppThread::ThreadPool::push(F&&, Args&& ...)::__lambda5::__args’          jobs_.emplace([f, args...] { f(args...); });                            ^ /home/zhou/Rlib/RcppThread/include/RcppThread/ThreadPool.hpp: In instantiation of ‘struct RcppThread::ThreadPool::push(F&&, Args&& ...) [with F = RcppThread::ThreadPool::parallelFor(ptrdiff_t, ptrdiff_t, F&&, size_t) [with F = compare_motifs_all_cpp(const List&, const string&, double, bool, std::vector<std::vector<double> >&, int, bool, double, bool, int, double, const std::vector<double>&, const string&)::__lambda14&; ptrdiff_t = long int; size_t = long unsigned int]::__lambda8&; Args = {const RcppThread::Batch&}]::__lambda5’: /home/zhou/Rlib/RcppThread/include/RcppThread/ThreadPool.hpp:129:9:   required from ‘void RcppThread::ThreadPool::push(F&&, Args&& ...) [with F = RcppThread::ThreadPool::parallelFor(ptrdiff_t, ptrdiff_t, F&&, size_t) [with F = compare_motifs_all_cpp(const List&, const string&, double, bool, std::vector<std::vector<double> >&, int, bool, double, bool, int, double, const std::vector<double>&, const string&)::__lambda14&; ptrdiff_t = long int; size_t = long unsigned int]::__lambda8&; Args = {const RcppThread::Batch&}]’ /home/zhou/Rlib/RcppThread/include/RcppThread/ThreadPool.hpp:201:9:   required from ‘void RcppThread::ThreadPool::parallelFor(ptrdiff_t, ptrdiff_t, F&&, size_t) [with F = compare_motifs_all_cpp(const List&, const string&, double, bool, std::vector<std::vector<double> >&, int, bool, double, bool, int, double, const std::vector<double>&, const string&)::__lambda14&; ptrdiff_t = long int; size_t = long unsigned int]’ /home/zhou/Rlib/RcppThread/include/RcppThread/parallelFor.hpp:48:5:   required from ‘void RcppThread::parallelFor(ptrdiff_t, ptrdiff_t, F&&, size_t, size_t) [with F = compare_motifs_all_cpp(const List&, const string&, double, bool, std::vector<std::vector<double> >&, int, bool, double, bool, int, double, const std::vector<double>&, const string&)::__lambda14; ptrdiff_t = long int; size_t = long unsigned int]’ compare_motifs.cpp:1485:18:   required from here /home/zhou/Rlib/RcppThread/include/RcppThread/ThreadPool.hpp:129:27: error: using invalid field ‘RcppThread::ThreadPool::push(F&&, Args&& ...)::__lambda5::__args’ /home/zhou/Rlib/RcppThread/include/RcppThread/ThreadPool.hpp: In instantiation of ‘struct RcppThread::ThreadPool::push(F&&, Args&& ...) [with F = RcppThread::ThreadPool::parallelFor(ptrdiff_t, ptrdiff_t, F&&, size_t) [with F = pval_extractor(const std::vector<int>&, const std::vector<double>&, const std::vector<int>&, const std::vector<int>&, const string&, const std::vector<int>&, const std::vector<int>&, const std::vector<double>&, const std::vector<double>&, const std::vector<std::basic_string<char> >&, int)::__lambda15&; ptrdiff_t = long int; size_t = long unsigned int]::__lambda8&; Args = {const RcppThread::Batch&}]::__lambda5’: /home/zhou/Rlib/RcppThread/include/RcppThread/ThreadPool.hpp:129:9:   required from ‘void RcppThread::ThreadPool::push(F&&, Args&& ...) [with F = RcppThread::ThreadPool::parallelFor(ptrdiff_t, ptrdiff_t, F&&, size_t) [with F = pval_extractor(const std::vector<int>&, const std::vector<double>&, const std::vector<int>&, const std::vector<int>&, const string&, const std::vector<int>&, const std::vector<int>&, const std::vector<double>&, const std::vector<double>&, const std::vector<std::basic_string<char> >&, int)::__lambda15&; ptrdiff_t = long int; size_t = long unsigned int]::__lambda8&; Args = {const RcppThread::Batch&}]’ /home/zhou/Rlib/RcppThread/include/RcppThread/ThreadPool.hpp:201:9:   required from ‘void RcppThread::ThreadPool::parallelFor(ptrdiff_t, ptrdiff_t, F&&, size_t) [with F = pval_extractor(const std::vector<int>&, const std::vector<double>&, const std::vector<int>&, const std::vector<int>&, const string&, const std::vector<int>&, const std::vector<int>&, const std::vector<double>&, const std::vector<double>&, const std::vector<std::basic_string<char> >&, int)::__lambda15&; ptrdiff_t = long int; size_t = long unsigned int]’ /home/zhou/Rlib/RcppThread/include/RcppThread/parallelFor.hpp:48:5:   required from ‘void RcppThread::parallelFor(ptrdiff_t, ptrdiff_t, F&&, size_t, size_t) [with F = pval_extractor(const std::vector<int>&, const std::vector<double>&, const std::vector<int>&, const std::vector<int>&, const string&, const std::vector<int>&, const std::vector<int>&, const std::vector<double>&, const std::vector<double>&, const std::vector<std::basic_string<char> >&, int)::__lambda15; ptrdiff_t = long int; size_t = long unsigned int]’ compare_motifs.cpp:1859:18:   required from here /home/zhou/Rlib/RcppThread/include/RcppThread/ThreadPool.hpp:129:27: error: using invalid field ‘RcppThread::ThreadPool::push(F&&, Args&& ...)::__lambda5::__args’ make: *** [compare_motifs.o] Error 1 ERROR: compilation failed for package ‘universalmotif’ * removing ‘/home/zhou/Rlib/universalmotif’  The downloaded source packages are in 	‘/tmp/RtmpS1nRO3/downloaded_packages’ Installation path not writeable, unable to update packages: boot, class, cluster,   codetools, foreign, KernSmooth, MASS, Matrix, mgcv, nlme, nnet, spatial, survival Warning message: In install.packages(...) :   installation of package ‘universalmotif’ had non-zero exit status
--

Cannot get scan_sequences to report p-values

Hi,
Thanks for developing the package! But I'm having a problem with getting scan_sequences to calculate p-values.

I'm trying to run scan_sequences with the motifs argument as a list of PWMatrixList objects, and sequences is a DNAStringSet.
Its runs ok with calc.pvals = FALSE but when I set it to TRUE I get the following error message:

Error in (function (classes, fdef, mtable) :
unable to find an inherited method for function ‘convert_motifs’ for signature ‘"NULL"’

The code is:
UM_scan_res_test <- scan_sequences(motifs=test_PWM,sequences = test_seq,threshold=0, threshold.type = "logodds", RC=TRUE,calc.pvals = TRUE,nthreads = 22)

I have tried using convert_motifs to convert the list of PWMatrixLists into a list of universalmotif objects and still get the same error.

I have tried to work around the issue by calculating p-values separately using motif_pvalue and can partially manage this with my data set. Unfortunately, there is a complicating secondary issue in that the results of scan_sequences (a DFrame) do not incorporate the motif ID (only '@ListData$motif' which corresponds to '@ListData$@name' of my PWMatrixList). It would be really helpful if it also outputs the ID slot (and this is output using as.data.table) which I think will always be unique. Where this all causes me headaches is the JASPAR database SPLICE collection where the motif names are not unique but the ID slot is. This means that I can output the results of the scan_sequences function (with scores), then use the scores as input to the motif_pvalue but I cannot supply the appropriate PWM in this particular case (it could be any of 6 because they are all named the same!).

I hope this makes sense. In the short term I will just not calculate p-values using univeralmotif for the JASPAR SPLICE collection.

As another aside I have deliberately set the threhold very low as I wanted to calculate a relative threshold in the same way TFBSTools does (and compare results). This uses a percentage (quantile) which is not percentage of the maximum score but a percentage of the range (from minimum to maximum)...this is digressing somewhat from the main issue though!

R session aborted / fatal error when running read_homer from a list of motifs

Hi,
I ran homer on a list of genomic regions, and my output generated 202 motifs.
Next, I ran I try to read the output using the read_homer function, but my R session aborts after 1-3 minutes. I do not get the same problem when I run the same code but when there are only 8 motifs in the directory (rather than 202). I'm not sure if the problem has something to do with the error I get (see below) about 'treeio'.

setwd("/Volumes/BINF1_Raid/home/aspepin/projects/H3K4me3_McMasterU/exp2/")
library(universalmotif)

# Bioconductor version 3.11 (BiocManager 1.30.12),
# ?BiocManager::install for help
# Bioconductor version '3.11' is out-of-date; the
# current release version '3.14' is available with R
# version '4.1'; see https://bioconductor.org/install
# Registered S3 method overwritten by 'treeio':
#   method     from
# root.phylo ape 

#the directory below contains the output of homer i.e. 202 motifs
up.dir <- "./output/homer_vs_whole_genome/CDvsHFD_up/knownResults/" 
up.motif.list <- list.files(up.dir,
                         pattern="*.motif",
                         full.names = TRUE,
                         recursive = TRUE)
up.motifs <- lapply(up.motif.list, function(x) read_homer(x))

Whether I run the above, or if I replace the last line (with lapply function) with the following:

up.motifs <- list()
up.motifs <- lapply(1:length(up.motif.list), function(i){
  motifs <- read_homer(up.motif.list[i])
  return(motifs)
})

My R session aborts and it says "R session aborted, encountered fatal error".

Any idea how I can solve this?

> sessionInfo()
R version 4.0.2 (2020-06-22)
Platform: x86_64-apple-darwin17.0 (64-bit)
Running under: macOS Mojave 10.14.5

Matrix products: default
BLAS:   /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRlapack.dylib

locale:
[1] en_CA.UTF-8/en_CA.UTF-8/en_CA.UTF-8/C/en_CA.UTF-8/en_CA.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets 
[6] methods   base     

other attached packages:
[1] universalmotif_1.6.4

loaded via a namespace (and not attached):
 [1] treeio_1.12.0       tinytex_0.31       
 [3] tidyselect_1.1.1    xfun_0.29          
 [5] purrr_0.3.4         lattice_0.20-41    
 [7] ggfun_0.0.5         colorspace_2.0-2   
 [9] vctrs_0.3.8         generics_0.1.1     
[11] stats4_4.0.2        yaml_2.2.1         
[13] utf8_1.2.2          gridGraphics_0.5-1 
[15] rlang_0.4.12        pillar_1.6.4       
[17] glue_1.6.0          DBI_1.1.2          
[19] BiocGenerics_0.34.0 rvcheck_0.1.8      
[21] lifecycle_1.0.1     ggseqlogo_0.1      
[23] zlibbioc_1.34.0     Biostrings_2.56.0  
[25] munsell_0.5.0       gtable_0.3.0       
[27] IRanges_2.22.2      ps_1.6.0           
[29] parallel_4.0.2      fansi_0.5.0        
[31] Rcpp_1.0.7          scales_1.1.1       
[33] BiocManager_1.30.12 S4Vectors_0.26.1   
[35] jsonlite_1.7.2      XVector_0.28.0     
[37] ggplot2_3.3.5       aplot_0.1.2        
[39] processx_3.5.2      dplyr_1.0.7        
[41] rbibutils_2.2.7     grid_4.0.2         
[43] ggtree_2.2.4        Rdpack_2.1.4       
[45] tools_4.0.2         yulab.utils_0.0.4  
[47] magrittr_2.0.1      lazyeval_0.2.2     
[49] patchwork_1.1.1     tibble_3.1.6       
[51] crayon_1.4.2        ape_5.6-2          
[53] tidyr_1.1.4         pkgconfig_2.0.3    
[55] MASS_7.3-53.1       ellipsis_0.3.2     
[57] tidytree_0.3.9      ggplotify_0.0.6    
[59] assertthat_0.2.1    rstudioapi_0.13    
[61] R6_2.5.1            nlme_3.1-152       
[63] compiler_4.0.2 

Can't run meme with custom alphabet

Hello!
I have the issue that I can't run the run_meme function with custom alphabet. The first traceable error is in 'get_custom_meme_alph', where the alphabet file is read. It searches for a match to "END ALPHABET" which can't exist, if the alphabet is to pass other checks. I have made a small reproducible example with Biostrings version 2.64 and universalmotif version 1.13.9 where I create the custom methylated DNA alphabet you have presented in the vignettes and write it to a file then try to use it (https://bioconductor.org/packages/release/bioc/manuals/universalmotif/man/universalmotif.pdf)

library(universalmotif)
library(Biostrings)

meme_alph("ACm", complements = "TGM", alph.name = "MethDNA",
          letter.names = c(A = "Adenine", C = "Cytosine", G = "Guanine",
                           T = "Thymine", m = "Methylcytosine", M = "mC:Guanine"),
          like = "DNA", ambiguity = c(N = "ACGTmM"),
          file = "temp.alph")

sequences <- DNAStringSet(c("ccaagtttcaa", "taaagttgtat", "aaatgccctac", "tccaagtttca", "gtaccttgtga", "ttaaaatcgtc", "gaacactttct", "aatcacttttg", "ttgtcctggcg"))
# Runs great with standard DNA alphabet
first_attempt <- run_meme(sequences, bin = "/path/to/bin", verbose = 0)
# Doesn't run
secont_attempt <- run_meme(sequences, bin = "/path/to/bin", verbose = 0, alph = "temp.alph")

Improve support for metadata-based manipulations

This may be beyond the purview of universalmotif, but I wanted to float this
idea with you because, to me, universalmotif has done a beautiful job of
standardizing the process of manipulating motifs in R and I would like for it to
grow to be come the de-facto standard data structure for these tasks. I really
love this package.

With this in mind, one area I feel could be improved is how motifs are
manipulated and filtered in bulk. filter_motifs does a good job of filtering
on specific values, but an aspect I have struggled with in the past is the
process of tidying motifs and their metadata. In particular because I have to
write lots of boilerplate each time to do so.

For example, certain entries in motifDb have out of date information, because
the annotations it pulls from are not always well maintained (this is
particularly true for model organisms like Drosophila). In these instances,
it's necessary to make changes like renaming motifs, or join metadata from
disparate sources into 1 object.

For example, say I have a set of motifs which are named: "geneA_FBgn1000",
"geneB_FBgn2000", and I would like the universalmotif name slot to hold:
"geneA" and "geneB", while altname holds "FBgn1000", "FBgn2000". I need to
loop over each universalmotif object in the list, extract the name slot (using
@ is the only way I know to do this), split on "_", then directly reassign
each slot, again using @. At the very least, I think there should be some
setters/getters for each slot so this can be vectorized.

Extending this idea somewhat, say I now want to do this for data in the
extra_info slot, there's more boilerplate to deal with the assignment, and if
I'm thinking about this correctly, the potential for serious slowdowns because
the list will have to be reallocated to grow in size.

In other words, if universalmotif objects could become as easy to manipulate
as data.frames, this could greatly improve the workflows of motif-centric
analysis.

To explore these ideas, I created a prototype object using data.frames and
universalmotif together see here. In my
experiments, this structure has greatly improved the process of analyzing motif
databases themselves and the process of data cleaning using both base R and
tidyverse approaches. The idea is essentially to allow data.frames to convert
seamlessly back and forth between universalmotif and data.frame format.
In this way, all tools for data.frame manipulation are enabled for free in a way
that is compatible with universalmotif objects. Ideally, this would also allow
new columns to be stored as extra_info, but I haven't implemented anything
like that so far.

I was wondering if something like this may be of interest to formally port to
universalmotif. Perhaps the object could be extended to enable different
views similar to tidygraph, where in one form, the data could be viewed as a
data.frame, or alternatively, in the current universalmotif format.

If you think something like this is worthwhile, I would be happy to help
implement it.

altname is dropped during summarise_motif

although as.data.frame behaves correctly, summarise_motifs always drops the 'altname' column.

m <- create_motif()

m["altname"] <- "myAlt"
m["organism"] <- "drosophila"
m["family"] <- "ETS"

df <- universalmotif::as.data.frame(m)
summary <- universalmotif::summarise_motifs(m, na.rm = F)
df$altname == summary$altname

This is caused by a typo in summarise_motifs_cpp.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.