This may be beyond the purview of universalmotif, but I wanted to float this idea

This looks great so far. The one downside is the "AsIs" thing with <code class="notran

Improve support for metadata-based manipulations,about bjmt/universalmotif

Comments (22)

bjmt commented on July 17, 2024 1

I've set up a new tidy-motifs branch with some experimental code (in R/universalmotif_df.R). There are three new functions:

to_df(): generate the data.frame
to_list(): update motifs and return to a list
update_motifs(): update motifs and stay as a data.frame

Have yet to implement manipulation of the extrainfo and bkg slots so far.

Let me know what you think if you get a chance to look at it. Thanks again for the inspiration.

from universalmotif.

bjmt commented on July 17, 2024 1

Awesome. The code has been merged into master. I'll add some examples to the vignettes over time (and maybe some tests), but I think the current implementation should be fairly stable for a while. Ideas/contributions would be greatly appreciated of course, you seem to have better ideas than I when it comes to using tidyverse functions.

Feel free to close if you feel content with how it's working, otherwise let me know if you come across any problems.

from universalmotif.

bjmt commented on July 17, 2024

Thanks for the input. I'm glad you like the package and have found it useful.

Frankly, this is something I agree with you on. For some time now I have had on my to do list an idea to implement a singular compact data structure for motifs (somewhat like MotifDb) to make manipulation of multiple motifs at once much easier. Unfortunately it has simply never been a high priority goal for me relative to other things. However, I would be happy to work on it now if you are interested in helping towards this goal.

I've looked at the implementation you linked, and it certainly looks quite elegant. One thing I like about using regular lists to store motifs (as messy as it can be to work with at times) is that it is dead simple, and doesn't require that people read documentation to be able to interact with them (most won't). I'm saying this because I feel the way you've implemented it actually circumvents this issue nicely: it requires performing an additional action to get the combined object (as_universalmotif_dataframe), at which point I think it's safe to assume the user has read some of the documentation and know what they're doing (though maybe you disagree with this point and believe the combined object should be the default representation regardless). So I think if it were to be implemented in this package it could follow a similar idea. (Your idea about new columns being stored in the extrainfo slot is also interesting --- so one could just add another column with $ and it would automatically be added as an additional item in the extrainfo slot.)

One thing I'm unsure of: whether to create a new S4 representation for multiple motifs (like MotifDb) or simply do as you have done, with the actual motifs preserved in one of the columns of a plain data frame. Using an S4 object would allow me to provide safety checks for any modifications being performed, but could also mean it not being useable for functions which rely on method dispatch to interact with data frames. (Though I guess one way to save users from performing incorrect modifications of motif metadata in plain data frames would be to keep the original motifs in another column as you have done, and allow for the incorrect modifications to be reset based on the unmodified motif column once they are alerted of the error when either trying to update the motif column or converting back to a regular list. This would just mean someone wouldn't know of their mistake until later. Though after writing this out it is starting to sound a little messy, but maybe it'd be worth it so tidyr-type interactions as you've done are possible.)

I will think about some more and do some experimentation (I will poke around some of the 'tidy' Bioconductor packages and maybe get some inspiration to help with better integrating S4 with the tidy ecosystem). Feel free to chime in if you have differing opinions to what I've said or suggestions. I'm happy to hear that you are willing to help me with this, but I think I would like to implement the basic combined data structure first. I will update the issue once I've got something working.

PS: You can actually get/set the name/altname slots (or any slot) safely this way:

motif["name"]
motif["name"] <- "new name"
motif["altname"]
motif["altname"] <- "new altname"

Not that I blame you for not knowing. This is hidden in the example section of the universalmotif-class man page, and section 2.1 of the motif manipulation vignette. I've stumbled across examples of code using this package a few times out in the wild, and most do as you did with the @. Better communicating this somehow is also on my to do list... (Also just better documentation in general honestly.)

from universalmotif.

snystrom commented on July 17, 2024

I've looked at the implementation you linked, and it certainly looks quite elegant. One thing I like about using regular lists to store motifs (as messy as it can be to work with at times) is that it is dead simple, and doesn't require that people read documentation to be able to interact with them (most won't). I'm saying this because I feel the way you've implemented it actually circumvents this issue nicely: it requires performing an additional action to get the combined object (as_universalmotif_dataframe), at which point I think it's safe to assume the user has read some of the documentation and know what they're doing (though maybe you disagree with this point and believe the combined object should be the default representation regardless). So I think if it were to be implemented in this package it could follow a similar idea. (Your idea about new columns being stored in the extrainfo slot is also interesting --- so one could just add another column with $ and it would automatically be added as an additional item in the extrainfo slot.)

no, I think the universalmotif object should remain the default presentation. I agree wholeheartedly about the list structure. It's a nice balance between the safety of S4 and base R syntax, and is usually sufficient for most tasks.
Also agree about the reading the docs bit. This smells like a more advanced feature, especially since it will have some safety tradeoffs relative to S4, even if they are eventually checked.

One thing I'm unsure of: whether to create a new S4 representation for multiple
motifs (like MotifDb) or simply do as you have done, with the actual motifs
preserved in one of the columns of a plain data frame. Using an S4 object would
allow me to provide safety checks for any modifications being performed, but
could also mean it not being useable for functions which rely on method dispatch
to interact with data frames. (Though I guess one way to save users from
performing incorrect modifications of motif metadata in plain data frames would
be to keep the original motifs in another column as you have done, and allow for
the incorrect modifications to be reset based on the unmodified motif column
once they are alerted of the error when either trying to update the motif column
or converting back to a regular list. This would just mean someone wouldn't know
of their mistake until later. Though after writing this out it is starting to
sound a little messy, but maybe it'd be worth it so tidyr-type interactions as
you've done are possible.)

This issue is the biggest one I've run up against with my implementation. On one hand, I am very comfortable with the flexibility, but I do worry a bit about safety or just confusing behavior for end-users. For example, in my structure currently, the linked motif column is not updated in sync with the data.frame, so if users pass df$motif to view_motifs without calling update_motifs or as_universalmotif, they'll see the old motif. Without re-exporting tidy functions, I don't see how an automated update could be implemented if we stuck with the data.frame route, but I am very much not in favor of reexporting tidy methods, especially since many conflict with AnnotationHub & motifdb functions, it would cause some annoying namespace issues for users, I think. Of course, I could be convinced otherwise.
something else to consider is: sometimes delaying the safety check may be a good thing. For instance, what if there's a weird format where the name encodes a character and a meaningful integer. A user may want to split the integer out in the data.frame format, of course it would stay a character before casting to double or int to fit in, for instance the E-value slot. If the safety check prevented directly splitting name into c("name", "evalue") because of a type mismatch, you'd have to do two steps and drop the intermediate evalue_char column. Not a deal breaker though.

I will think about some more and do some experimentation (I will poke around
some of the 'tidy' Bioconductor packages and maybe get some inspiration to help
with better integrating S4 with the tidy ecosystem). ... I think I would like to implement
the basic combined data structure first. I will update the issue once I've got
something working.

Sounds great, looking forward to seeing what you come up with.

PS: You can actually get/set the name/altname slots (or any slot) safely this way:

Ah, I knew this. Silly me.

just better documentation in general honestly.

Overall, the docs are amazing. Literally when I want to teach someone things about motifs, I just point them straight there. Tons on knowledge in one place, which really can't be understated. Maybe a "quickstart" section would be sufficient.

from universalmotif.

snystrom commented on July 17, 2024

Another thought I just had re: delayed safety checks. If we implemented really good error messages for converting back to universalmotif format, these issues may resolve themselves. For instance, giving a preview of the invalid values, and suggest what data type they need to be and what slot they're trying to get assigned to. This could also help with the "most folks won't read the docs" thing.

from universalmotif.

snystrom commented on July 17, 2024

This looks great so far. The one downside is the "AsIs" thing with df$motif not working. Wonder if there's a way to improve that.

As far as extrainfo goes, perhaps there should be an optional argument to to_list() and update_motifs(), like keep_extra or something, set to FALSE by default. This way there's a simple way to keep the data, or purge it if the new columns are just temporary.

Found a teeny bug in to_list(), PR incoming.

from universalmotif.

snystrom commented on July 17, 2024

Related to AsIs, thinking more about this, throwing an error is a good way to ensure users call update_motifs() before using them, ensuring the motif is not out of date. Having an extra method on AsIs to update wouldn't work without some environment magic to grab the original df, so that's probably the best it can be if this is the solution.

Something to consider though, is that this data.frame structure would open up new ways of storing relationships between more than one motif. For example, if discovering a set of de-novo motifs, then comparing to a motif database, you could imagine storing another universalmotif column with the match (not saying this needs to get added to the package, just that users could do this on their own). In this case, you could imagine pivoting the df to get columns of each de-novo + match pair, or some other manipulation. In these instances, having the ability to directly call columns storing universalmotif objects would be rather powerful, enabling complex analysis of motif sets.

I admit I have skin in the game on this one, since I use this approach in a package I wrote, but if you think this is beyond what you want, I think that's totally reasonable. At some point, those unusual structures will no longer fit perfectly into this framework, but perhaps anchoring on a column named motif is sufficient to avoid most of those problems.

from universalmotif.

bjmt commented on July 17, 2024

Thanks for the PR. I've added a method in convert_motifs() which throws a more informative error message when trying to use the column directly (pretty sure just about every function in the package calls convert_motifs() first thing).

Thanks also for the extrainfo idea. I'll think a bit more about it.

I'll admit I'm curious about your idea for more complex inter-motif interactions, but I'm not sure I fully understand it. Do you mind typing out some pseudocode of what you think it could work like so I can get a better idea? I certainly have always wanted to improve the compare_motifs() --> merge_motifs() process for example, and maybe what your suggesting could help with that.

from universalmotif.

snystrom commented on July 17, 2024

Sorry for the late reply, got busy.

To explain a bit more about the scenario above, it is maybe more convoluted than it should be, but was the first example I thought of. But the idea is generally to be able to use the motif column in downstream operations on the universalmotif_df. Consequently, when I tried this using the current implementation, it worked great. So maybe the "AsIs" thing is less of an issue than I thought. That being said, the idea was say I have a set of motifs:

library(magrittr)
library(universalmotif)
myMotifs <- c(universalmotif::create_motif("CCRAAAW", name = "motif_1"),
              universalmotif::create_motif("TTAATT", name = "motif_2"))

I want to compare these to a set of known motifs, identify a "best match", then evaluate visually whether they're a good match. I'll use the flyFactor survey data from motifDb.

flyFactorDb <- MotifDb::MotifDb %>%
  MotifDb::query("FlyFactorSurvey") %>% 
  convert_motifs()
#> See system.file("LICENSE", package="MotifDb") for use restrictions.

How I do the comparison doesn't really matter. For brevity, I've abstracted this away to a dummy function: match_motifs(). But the idea is that this function takes some motifs & spits out which motif from the database is the best match to the input. You could imagine others make features like this to return universalmotif_dfs, or append columns to an input universalmotif_df (I use this approach in my MEME Suite wrapper package). Here I used a non-universalmotif_df for brevity.

# match_motifs is a hypothetical function returning some information about a match & a universalmotif of the top hit.
(myMatches <- match_motifs(myMotifs, flyFactorDb))
#>      name                                                     best_match_motif
#> 1 motif_1 <S4 class 'universalmotif' [package "universalmotif"] with 20 slots>
#> 2 motif_2 <S4 class 'universalmotif' [package "universalmotif"] with 20 slots>

To associate the match with the input metadata, do a join:

# join this data back to universalmotif_df
motif_df <- myMotifs %>% 
  to_df() %>% 
  dplyr::left_join(myMatches, by = "name")

Then finally, I build a new column for the pair of motifs (the input & the match) and visualize:

# now to visualize the input:match pair, merge the two, then plot
motif_df %>% 
  dplyr::mutate(motif_and_match = purrr::map2(motif, best_match_motif, c)) %>% 
  {purrr::map(.$motif_and_match, view_motifs)}
#> [[1]]

I guess all this is to say, I thought "AsIs" would stop the motif column from playing nice with tidy functions, but it doesn't, so I think it's good as is (pun intended).

from universalmotif.

bjmt commented on July 17, 2024

No worries, there's no rush.

Wow -- you are certainly creative. (Or maybe I'm just getting out of touch with the tidyverse..) I like the idea of getting multiple columns of motifs and applying transformations/etc to them simultaneously by row. I'm gonna think more about this. Though I suspect in the end I may just decide to provide have the basic universalmotif_df format alongside example workflows (doing things like in your example) instead of creating new functions/retrofitting old ones. (Unless you can think of any function which flat out doesn't work with the format, in which case I don't mind patching such cases.)

This has made me think though: maybe I shouldn't be throwing an error (via detection of the AsIs attribute) when the motif column is used directly in functions, and just risk people forgetting to update. I'll have to play around with things a bit more though, since somehow something neat like this doesn't seem to cause any problems:

m1 <- create_motif(name = "A")
m2 <- create_motif(name = "B")
m3 <- create_motif(name = "C")
m4 <- create_motif(name = "D")

df <- data.frame(V1 = I(c(m1, m2)), V2 = I(c(m3, m4)))
df
#>        V1      V2
#> 1 <mot:A> <mot:C>
#> 2 <mot:B> <mot:D>

apply(df, 1, merge_motifs)
# No error!

But otherwise, let me know if you think anything about the current implementation could be changed to make some of the things you had in mind work. Again I'm very glad for all of your input so far.

from universalmotif.

snystrom commented on July 17, 2024

I agree, I think this structure works well, and I can't think off the top of my head of any additional functions you'd need to add for this to work well. In both of our examples, these operations just work.

Related to the error throwing though, I think safety is a good thing, so maybe just make the error really informative. Error text like "direct calls to the motif column of a motif_df may result in out of date data. Please call to_list() on your data first as in: to_list(your_df)"

Although maybe this should be a warning (that could get really annoying)? I guess I go back and forth. Is there a way to mark the motif whether it's up to date or not (like with a hash?)? That's probably more trouble than it's worth though.

All said & done, I think this is looking really nice. In my tests so far, it works quite well.

~~Maybe 1 remaining function to add is a update_motif() which will update the motif slot using the df names. This way you can continue with df operations in a pipe. For example:~~

~~motifs <- c(create_motif(name = "A"), create_motif(name = "B"))~~

motifs %>%
to_df() %>%
mutate(name = "new_name") %>%
update_motif() %>%
mutate(new_col = some_operation(motif))

from universalmotif.

snystrom commented on July 17, 2024

(Also I am an idiot, of course update_motifs() exists.)

from universalmotif.

snystrom commented on July 17, 2024

alongside example workflows (doing things like in your example) instead of creating new functions/retrofitting old ones

I agree with this. Examples of proper use go much further than just describing the feature. Happy to help write these docs. Once these make it into master , I'll transition memes away from my homebrew version to the official structure. Hoping to submit that to bioc in a few weeks. So that will be another source of applied examples. The vignettes use lots of universalmotif tools in this type of cycle of motif discovery, comparison, etc.

Longer term, you could imagine writing a Bioc workflow integrating some of the more advanced universalmotif features with other bits of the bioconductor stack. At last years bioc meeting, several folks were interested in more motif analysis workshops or examples, so I imagine it would be a rather popular topic.

from universalmotif.

bjmt commented on July 17, 2024

Oh and with regards to marking out of date motifs: one possibility would be to add that to the print.universalmotif_df method. I worry somewhat it could a rather heavy task to perform for large datasets, but I can test it out. It could certainly make things a little friendlier for interactive use.

from universalmotif.

snystrom commented on July 17, 2024

Finally getting around to porting memes over to use these. And so far things are going smoothly. However, one thing I'm running into is motif_df %>% to_list(extrainfo = TRUE) %>% to_df(extrainfo = TRUE) converts all the extrainfo columns to character. I think this is simply because the extrainfo slot stores everything as a character. Is that something that could be altered? I seem to recall part of the reason that feature exists is because of floats & rounding issues, so would be curious to hear your thoughts.

from universalmotif.

bjmt commented on July 17, 2024

The extrainfo slot was intended to just be for any extra information contained in non-universalmotif motifs that I didn't consider worthy of a specific slot for (for example, the data source information from MotifDb motifs). Using a character slot in universalmotif objects seemed like an easy catch-all solution. Eventually I also made it so that meme motif P-values were also stored there in case they were smaller than R's doubles limit after someone brought it up in an issue, and that was the easiest solution to the problem I could think of at the time (but the slot was not originally intended for that).

I suppose changing the slot to a list-type slot could make it more flexible, but I don't want to cause issues with backwards compatibility. How pressing/important do you feel the need to have the ability to hold extra numeric information? I wouldn't be too averse to adding a numeric slot (maybe extranum or something). Not sure I like the idea of more complex slots like lists or data.frames though.

from universalmotif.

snystrom commented on July 17, 2024

How pressing/important do you feel the need to have the ability to hold extra numeric information?

I think it would make the list -> df workflow much more useful if this feature existed. But I agree there should be a reasonable limit to this (no lists, matrices, or data.frames seems reasonable). The other thing to consider is what to do about factors, which I'd say store as a character with the factor name, rather than as a numeric.

I appreciate though the backwards compatibility issue & some of the more nuanced uses of the slot, like the MEME p-value issue though, so if it's not an easy fix I think it can easily wait.

from universalmotif.

snystrom commented on July 17, 2024

Also, what do you think about downgrading the extrainfo warning in to_list() to a message()? Now that I'm using these internally, I want to silently dump the extrainfo but using suppressWarnings feels like a code smell & potential to cause silent issues to happen. If the extrainfo warning was a message, suppressMessages would silence it & allow warnings to bubble up.

from universalmotif.

snystrom commented on July 17, 2024

Actually. This is a little bigger than I thought. Because this means calling update_motifs converts all columns to character, so it actually would break pipes.

For instance:

df <- create_motif("TTT", name = "motif1") %>%
   to_df() %>%
   mutate("test" = 100)

df$test == update_motifs(df, extrainfo = TRUE)$test
> FALSE

What about altering update_motifs to hold out factors, matrix, list and other unsupported datatype columns then add them back after calling to_df(m, extrainfo)? They could be joined by name since that will always be up to date.

from universalmotif.

bjmt commented on July 17, 2024

I can change the warning to a message. Yes, good point about factors -- properly coercing to characters seems more reasonable than adding a slot for factors. I don't mind adding an extranum slot, thinking about it more I think it's a good idea.

What about altering update_motifs to hold out factors, matrix, list and other unsupported datatype columns then add them back after calling to_df(m, extrainfo)? They could be joined by name since that will always be up to date.

Hmm, I could do this. Do you think this should be an optional parameter or default behaviour? The former could get kind of annoying depending on the workflow, but the latter seems more prone to mistakes.

from universalmotif.

snystrom commented on July 17, 2024

Do you think this should be an optional parameter or default behaviour?

I've got something working I'll PR in a bit that does a pretty good job. My initial guess is it'd be OK for default behavior. The trick I think is correctly merging the data back. Right now I'm doing something silly just using the values of name and consensus to merge back, but we could work something more robust out.

At the very least, if it does have an error, it would never mess up the motif matrix, or result in a huge universalmotif object getting nested into the df.

from universalmotif.

snystrom commented on July 17, 2024

Closed by #18.

from universalmotif.

Improve support for metadata-based manipulations about universalmotif HOT 22 CLOSED

Comments (22)

Related Issues (18)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent