hadley / reshape Goto Github PK

View Code? Open in Web Editor NEW

209.0 21.0 58.0 2.78 MB

An R package to flexible rearrange, reshape and aggregate data

Home Page: http://had.co.nz/reshape

License: Other

R 77.93% C++ 22.07%

reshape's Introduction

reshape2

Status

reshape2 is superseded: only changes necessary to keep it on CRAN will be made. We recommend using tidyr instead.

Introduction

Reshape2 is a reboot of the reshape package. It's been over five years since the first release of reshape, and in that time I've learned a tremendous amount about R programming, and how to work with data in R. Reshape2 uses that knowledge to make a new package for reshaping data that is much more focused and much much faster.

This version improves speed at the cost of functionality, so I have renamed it to reshape2 to avoid causing problems for existing users. Based on user feedback I may reintroduce some of these features.

What's new in reshape2:

considerably faster and more memory efficient thanks to a much better underlying algorithm that uses the power and speed of subsetting to the fullest extent, in most cases only making a single copy of the data.
cast is replaced by two functions depending on the output type: dcast produces data frames, and acast produces matrices/arrays.
multidimensional margins are now possible: grand_row and grand_col have been dropped: now the name of the margin refers to the variable that has its value set to (all).
some features have been removed such as the | cast operator, and the ability to return multiple values from an aggregation function. I'm reasonably sure both these operations are better performed by plyr.
a new cast syntax which allows you to reshape based on functions of variables (based on the same underlying syntax as plyr):
better development practices like namespaces and tests.
the function melt now names the columns of its returned data frame Var1, Var2, ..., VarN instead of X1, X2, ..., XN.
the argument variable.name of melt replaces the old argument variable_name.

Initial benchmarking has shown melt to be up to 10x faster, pure reshaping cast up to 100x faster, and aggregating cast() up to 10x faster.

This work has been generously supported by BD (Becton Dickinson).

reshape's People

Contributors

Stargazers

Watchers

Forkers

kohske briandiggs wch vspinu abresler davharris duncandoo jucor kevinushey kenahoo pluraldj fsgp libardo1 powerbar totallybullshit berrykim qlycool tarekdib03 renzhonglu balajin-cse kireru notingbad ikwenzi cesarmaalouf shafcodes jmacdon gvanzin saurfang jimhester lionel- xie137z eric-1986 mariogtr raniereramos rlugojr huangleiabcde starpiratewl juzenn connectthefuture rockchen26sh rmcd1024 chrisleick carlboneri iglisd rishabh3198 takewiki duydn insightdataintel siyangming byandydufresne fayeee-e valkriot jdixoncs standardgalactic erge324 michaelchirico sluan-jh

reshape's Issues

variable.name in 'melt' with data.table does not work if 'variable' column exists

If a column is already called 'variable', melt has a problem with data.table, even if a new name is indicated with variable.name:

testDat = data.table('variable'=c('a','b'),'bla1'=c(1,2),'bla2'=c(3,4))
melt(testDat,id.vars='variable',variable.name='Summary')

$ Error in setnames(ans, "variable", variable.name) :
$ Some items of 'old' are duplicated (ambiguous) in column names: variable

problems with some aggregation functions (min, max, median)

I've come across some unexpected problems with reshape 2. In short, some choices for fun.aggregate cause warnings or errors. Use of either min or max produces a warning, though dcast still gives the right result. Using median produces an error, and no results at all. The following short bit of code illustrates this point:

library(reshape2)
testdata = data.frame(type=rep(c("a","b"),10), value=1:20)
dcast(testdata, type~. , fun.aggregate=mean)
dcast(testdata, type~. , fun.aggregate=sum)
dcast(testdata, type~. , fun.aggregate=min)
dcast(testdata, type~. , fun.aggregate=max)
dcast(testdata, type~. , fun.aggregate=median)

Inconsistent behaviour in melt.list() with multi-level lists:

R> melt(list(blah=c(1,230,123), a=list(c="b", d="f")))
  value   L1   L2
1     1 blah <NA>
2   230 blah <NA>
3   123 blah <NA>
4     1    a    c
5     2    a    d
R> melt(list(blah=c(1,230,123,'s'), a=list(c="b", d="f")))
  value   L1   L2
1     1 blah <NA>
2   230 blah <NA> 
3   123 blah <NA>
4     s blah <NA>
5     b    a    c
6     f    a    d

In the first, the "value" is numeric, in the second, it's "factor". I'm guessing it just sticks with what ever format the first list ends up with when it's converted to a data frame? This probably isn't a major problem, cause no-one in their right mind is going to be using melt with lists like this, but I noticed it, and thought it might be relevant in cases I hadn't thought of.

melt 1.4.0.99 and list columns

Hi Kevin, Hadley,

I had previously noted down some of the differences between melt.data.table and melt.data.frame (from 1.2.2) in ?melt.data.table under NOTE section. After updating to the recent version, I see a difference in the way list columns are handled in the latest version of melt.

require(data.table)
df <- as.data.frame(data.table(x=1:5, y=as.list(6:10), z=11:15))
> sapply(df, class)
#         x         y         z 
# "integer"    "list" "integer" 

## 1.2.2
sapply(melt(df, id=1:2), class)
#         x         y  variable     value 
# "integer"    "list"  "factor" "integer" 

## 1.4
sapply(melt(df, id=1:2), class)
# Error: Can't melt data.frames with non-atomic columns

The same error also occurs when a measure.var is of type list (which is understandable). But when id.vars is of type list, there is no reason to error.

melt from 1.2.2 was inconsistent on columns of type list in measure.vars, in that, it'll unlist that column - which basically may or may not result in error depending on the length of each element in the list.

I'm guessing this isn't deliberate, as there's no mention in the NEWS and it is also not clear from ?melt.data.frame. Is it?

Ideally, it'd be nice if ?melt.data.frame explains what is and isn't supported under the DETAILS section, which is currently empty, so that it's easier for packages that implement on top of it (and for users) to identify these relatively easily.

melt.data.frame OBJECT bit not set when value is POSIXct

Simillar to issue #43. The OBJECT bit in the SEXP sxpinfo header is not being set when the value field is a vector with attributes, in this case, POSIXct. The effect is that print.default get dispatched on the vector rather than the appropriate S3 method.

Associated StackOverflow Q&A is here.

Faster colsplit function

While reshape is great, I just found that in my application the colsplit function is kind of a bottleneck that makes the code quite slow. The problem is that str_split_fixed seems to be pretty slow on large vectors. I programmed an alternative version that is much quicker for large vectors that have many duplicates. Below is the function and an example that illustrates the issue in my application. In the example, the new function has more than a 100 fold speed increase.

# Version of colsplit that works much faster for large vectors
# with many duplicates
colsplit = function (string, pattern, names,split.unique = NROW(string)>100)
{
  # Original Computation: split all string
  # Problem str_split_fixed can be quite slow for long vectors
  if (!split.unique) {
    vars <- str_split_fixed(string, pattern, n = length(names))
    df <- data.frame(alply(vars, 2, type.convert, as.is = TRUE),
                     stringsAsFactors = FALSE)
    names(df) <- names

    # Only split unique strings and match afterwards
    # works much faster for long vectors with many duplicates
  } else {
    uni.string = unique(string)
    # Only have speed gains in case there are substantially less
    # unique strings than normal strings
    if (length(uni.string)>0.5*length(string))
      return(colsplit(string,pattern,names,split.unique = FALSE))
    uni.df <- colsplit(uni.string,pattern,names,split.unique=FALSE)
    rows <- match(string,uni.string)
    df <- uni.df[rows,]
  }
  df
}


# An example with timing
library(reshape2)
library(stringr)
library(plyr)

T = 10000
prod = c("A","B","C")
attr = c("x","y","z")
cross.paste = function(left,right,sep="_") {
  as.character(t(outer(left,right,paste,sep=sep)))
}
df = as.data.frame(cbind(1:T,matrix(runif(T*3*3),T,3*3)))
colnames(df) = c("t",cross.paste(prod,attr))
head(df)

# Melt df
df.melt = melt(df,id.var="t")
NROW(df.melt)
head(df.melt)

# Want to separate prod and attr: use colsplit

# Original version is very slow
system.time(df.split <- reshape2::colsplit(df.melt$variable,"_",c("prod","attr")))
#user  system elapsed
#52.87    0.14   59.2

# Modified version works much quicker in this example
system.time(df.split2 <- colsplit(df.melt$variable,"_",c("prod","attr"),split.unique=TRUE))
#user  system elapsed
#0.39    0.00    0.51

# Results are the same
identical(df.split,df.split2)

# Finish the transformation to the desired format in df.final
df.work = cbind(df.split2,df.melt)
head(df.work)
df.final = dcast(data=df.work,t + prod ~ attr,value.var = "value")
head(df.final)

melt behaviour with indexed but non-unique dataframe column names

Hadley et al.,
As a regular user of the excellent reshape2 package, the following behavior of melt() surprised me:

Simple dataframe:
in.df <- data.frame(AA=c(1,2,3), BB=c(10,11,12), CC=c(13,14,15))

melt() works as expected
melt(in.df, id.vars=c(1))

But my dataframe has duplicated (non-unique) column names
names(in.df) <- c("AA","BB","BB")

This doesn't do what I want
melt(in.df, id.vars=c("AA"))

So, I use indexing (rather than renaming the columns).
melt(in.df, id.vars=c(1), measure.vars=c(2,3))

The dimensions are as expected, contents are not.
Should melt_check() abort if measure.vars names are not unique?
if(length(measure.vars) != length(unique(measure.vars))) { stop("names of measure variables are not unique: ", vars, call. = FALSE) }

Thanks again for creating an essential tool.

dcast causing reproducible stack overflow due to segmentation fault

RStudio was crashing when I tried to reshape a particular data frame using dcast. I discovered that the crash was actually happening in R itself, so I ran my casting code in R.app and got: Error: segfault from C stack overflow.

I can't provide a true reproducible example, because my data frame is about 558,000 rows and the problem doesn't occur on smaller scales. For example, if I run a loop that runs dcast on 100,000-row increments of the data frame, I don't get the error. However, the problem is reproducible when I run dcast on the full data frame.

Here is a subset of the data frame I'm casting from (with fake values for some variables), followed by the casting function I'm using. The real data set has about 700 levels of prog, 15 levels of prog1, and 5 levels of fa.type. If it would help, I can arrange to get you a copy of the data frame with private information stripped out.

  id        term   yr    nslds acad.lev    prog            prog1 fa.type amount
1  1   Fall 2009 2010 Graduate Graduate  loan 1      Other Loans    Loan   5000
2  1 Spring 2010 2010 Graduate Graduate  loan 1      Other Loans    Loan   5000
3  2   Fall 2009 2010 Graduate Graduate  loan 2    Stafford Loan    Loan   8781
4  2 Spring 2010 2010 Graduate Graduate  loan 2    Stafford Loan    Loan   8781
5  3   Fall 2007 2008 Graduate Graduate  loan 3    Stafford Loan    Loan   4250
6  3   Fall 2007 2008 Graduate Graduate grant 1 University Grant   Grant   1707

fa.wide = dcast(id + term + yr + nslds + acad.lev ~ prog1 + fa.type , data=fa, value.var="amount", fun.aggregate=sum)

In order to get my data reshaped, I loaded the reshape package and used the cast function without a problem. Here's my R session info (note that it shows both reshape and reshape2 loaded, but I had only reshape2 loaded when I first got the error):

R version 2.15.1 (2012-06-22)
Platform: x86_64-apple-darwin9.8.0/x86_64 (64-bit)

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] splines stats graphics grDevices utils datasets methods base

other attached packages:
[1] reshape2_1.2.2 reshape_0.8.4 RColorBrewer_1.0-5 Hmisc_3.10-1
[5] ggplot2_0.9.3 plyr_1.8 doBy_4.5-5 MASS_7.3-23
[9] multcomp_1.2-15 survival_2.37-2 mvtnorm_0.9-9994

loaded via a namespace (and not attached):
[1] cluster_1.14.3 colorspace_1.2-1 dichromat_2.0-0 digest_0.6.3 grid_2.15.1
[6] gtable_0.1.2 labeling_0.1 lattice_0.20-13 Matrix_1.0-11 munsell_0.4
[11] proto_0.3-10 scales_0.2.3 stringr_0.6.2 tools_2.15.1

Dimension Mismatch using melt() in reshape2

Modified from StackOverflow:
http://stackoverflow.com/questions/36461653/dimension-mismatch-using-melt-in-reshape2

I am trying to melt a data frame from 'wide' format to 'long' format in R, using the function 'melt' in the package 'reshape2'. However, I am encountering an issue with dimensions when trying to view the output data frame which I am having trouble deciphering. Here is an example:

# load reshape2 package
require(reshape2)

# sample data frame generated using dput
df <- structure(list(year = c(2001, 2002, 2003, 2004), 
                     aet = structure(c(493.1, 407.1, 476.7, 501.6), .Dim = 4L), 
                     drainage = structure(c(5.4, 5.4, 5.4, 5.4), .Dim = 4L), 
                     srunoff = structure(c(25.6, 24.3, 56.0, 50.3), .Dim = 4L)),
                .Names = c("year", "aet", "drainage", "srunoff"), row.names = c(NA, 4L), class = "data.frame")

# if i melt without specifying id.vars, it provides a warning but works works fine
df.melt <- melt(df)

# check output
head(df.melt)

# view output
View(df.melt)
# this works fine, and the data frame is visible in RStudio

# now, melt while supplying year as an id variable
df.melt.id <- melt(df, id.vars="year")

# check output
head(df.melt.id)
# the first 6 lines of output print to the console menu, as normal

# view output
View(df.melt.id)

However, when I try to view the df.melt.id data frame, I get the following error:

Error in FUN(X[[i]], ...) : 
  dims [product 4] do not match the length of object [12]

4 corresponds to the original length of the data frame, and 12 is how long it should be. If I check the dimensions using dim(df.melt.id), it returns the appropriate size: [1] 12 3

Using 'melt' from the original reshape package, everything works fine:

df.melt.id <- reshape::melt(df, id.vars="year")

Sefgault using dcast

Messing around with reshape2 resulted in the following segfault:

> head(dcast(mprecip, .~., sum))

 *** caught segfault ***
address 0x183f5d9e4, cause 'memory not mapped'

Traceback:
 1: .Call("split_indices", index, group, as.integer(n))
 2: split_indices(seq_along(.value), .group, .n)
 3: vaggregate(.value = value, .group = overall, .fun = fun.aggregate,     ..., .default = fill, .n = n)
 4: cast(data, formula, fun.aggregate, ..., subset = subset, fill = fill,     drop = drop, value.var = value.var)
 5: dcast(mprecip, . ~ ., sum)
 6: head(dcast(mprecip, . ~ ., sum))

Possible actions:
1: abort (with core dump, if enabled)
2: normal R exit
3: exit R without saving workspace
4: exit R saving workspace

Here, mprecip is a melted data frame of precipitation data:

> head(mprecip)
  Year variable value
1 1871      Jan  2.76
2 1872      Jan  2.32
3 1873      Jan  2.96
4 1874      Jan  5.22
5 1875      Jan  6.15
6 1876      Jan  6.41

Running the latest CRAN version on R version 2.15.2, OSX 10.8.2

... does not exclude value column not named "value"

I was trying to use cast() today and appreciate the 'value' argument. However, it does not interact how I would expect with "..." in the formula. The specified value column is not excluded. I think if you just renamed the column before parsing the formula, it would work as expected. This is with reshape version 0.8.3. Might be ancient, sorry if this has been already resolved.

acast different column and row sorting in R 3.2.4 and R 3.3.0 under Ubuntu and CentOS

Dear Prof Wickham,

I recently came across the fact that the sorting of columns and rows after using acast changes depending on the R version and platform used:

Ubuntu:

require(stringi)
require(plyr)
require(reshape2)
df <- data.frame(a=c("C10 Rep 1", "C106 Rep 1"),b=c("first-1","first1"),c=c(1,2))
dfc <- acast(data = df, formula = `b`~a, value.var = "c")
dfc
        C106 Rep 1 C10 Rep 1
first1           2        NA
first-1         NA         1
sessionInfo()
R version 3.2.4 Revised (2016-03-16 r70336)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu precise (12.04.5 LTS)

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8     LC_MONETARY=en_US.UTF-8   
 [6] LC_MESSAGES=en_US.UTF-8    LC_PAPER=en_US.UTF-8       LC_NAME=C                  LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] reshape2_1.4.1 plyr_1.8.4     stringi_1.1.1 

loaded via a namespace (and not attached):
[1] magrittr_1.5  tools_3.2.4   Rcpp_0.12.6   stringr_1.0.0

CentOS:

require(stringi)
require(plyr)
require(reshape2)
df <- data.frame(a=c("C10 Rep 1", "C106 Rep 1"),b=c("first-1","first1"),c=c(1,2))
dfc <- acast(data = df, formula = `b`~a, value.var = "c")
dfc
        C10 Rep 1 C106 Rep 1
first-1         1         NA
first1         NA          2
sessionInfo()
R version 3.3.0 (2016-05-03)
Platform: x86_64-redhat-linux-gnu (64-bit)
Running under: CentOS Linux 7 (Core)

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8     LC_MONETARY=en_US.UTF-8   
 [6] LC_MESSAGES=en_US.UTF-8    LC_PAPER=en_US.UTF-8       LC_NAME=C                  LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] reshape2_1.4.1 plyr_1.8.4     stringi_1.1.1 

loaded via a namespace (and not attached):
[1] magrittr_1.5  tools_3.3.0   Rcpp_0.12.6   stringr_1.0.0

Is this the expected behaviour due to recent changes in R regarding available sorting methods? In any case, it would be nice if the sorting of columns and rows could be consistent between different versions of R.

Thank you very much for your tremendous contribution to the community.

unintended behavior in melt?

Hi Hadley, thanks for writing such a cool package.

I was looking through the source for melt.array today, and I'm confused about its handling of character vectors. I thought I'd file an issue to make sure it's behaving as you intended.

The source code and comments seem to indicate that you didn't want to return factors (e.g. stringsAsFactors = FALSE in line 120), but I'm getting factors anyway with the following example (using version 1.2.2):

a = matrix(1:9, nrow = 3,dimnames = list(letters[1:3], LETTERS[1:3]))

str(melt(a))

I've tracked down where the conversion to factors is occurring, in case you'd to change it or add it as an option for the user. If you're interested, I'd be happy to write a pull request.

Currently, you run the following function on each dimname:

  var.convert <- function(x) if(is.character(x)) type.convert(x) else x

This calls type.convert, which includes as.is = FALSE by default. This means that character vectors that can't be coerced to other classes will be converted to factors, and there won't be any more strings in dn, which is a temporary label list. As a result, the stringsAsFactors = FALSE argument in the next line doesn't do anything, because there are no strings left to convert.

There seems to be a similar issue with melt.data.frame, but I haven't looked at it as closely.

It seems like it would be nice to have the option to return strings (or possibly to change the default behavior, if that's what you'd prefer). Could I submit a pull request to add this feature?

Thanks again!

Allow specifying data type of value column

This is kind of the opposite to #51. In melt, the value column is coerced to character if only one of the measure.vars is a character. Perhaps it would make sense to introduce a parameter that specifies the desired output type, and to warn if such coercion happens. Something like melt(d, coerce = as.numeric) or melt(d, target_type = "numeric") would substitute values that cannot be coerced with NA (with a warning) and turn off the warning that value types are mixed.

level order for melt.array

Since arrays row/column/... order can be meaningful, it would be nice to have an option in melt.array to be able to have the levels of factors generated from dimnames to be in the original array order.

unexpected behavior in reshape2:dcast() due to 'value_var' name change

Hi Hadley:

I had some code, buried deep (of course), which stopped working when the 'value_var' parameter was renamed in dcast(). All fixed now, but the error message was not very helpful in tracking it down:

> crosstab = dcast(w.df, variable~value, value_var='V1', margins=T, fun.aggregate=sum)
Error in Summary.factor(integer(0), value_var = "V1", na.rm = FALSE) :
  sum not meaningful for factors

It might be worth a line of code to accept and pass along 'value_var' to 'value.var' and/or issue a warning().

Just a thought,
Jeffrey

`. ~ .` crashes

e.g. dcast(mtcars, . ~ ., mean)

dcast drops data class

dcast doesn't preserve the class of columns, for example POSIX times:

uu <- data.frame(a=1,b=1,x=as.POSIXct("2023-03-05 12:20"))
dcast(uu,a~b,value.var="x")
a 1
1 1 1678036800

It should.

Setting value variable in cast does not if "value" column already exists

df <- data.frame(
  id1 = rep(letters[1:2],2), 
  id2 = rep(LETTERS [1:2],each=2), var1=1:4)

df.m <- melt(df)
df.m$value2 <- runif(4)
cast(df.m, id1~id2, value="value")
cast(df.m, id1~id2, value="value2")

Make dcast() work with each()

I bumped onto this one recently after trying to switch to new reshape from the old one on @jeroenooms recommendation. However, dcast works if aggregation function returns only one number. I even ranted on SO about this one. I'd really like to use each in this context, as it's uber-nice function and is really handy in data aggregation. Is such functionality going to be available in the (near) future?

melt.data.frame fails if there are no value variables

Eg:

> df <- data.frame(x="a", y="b", z="c")
> melt(df)
Using x, y, z as id variables
Error in measure.attributes[[1]] : subscript out of bounds

It should return the data.frame unchanged, as per previous behaviour.

dcast: check names?

When the variable that is getting casted into a new column has white spaces, columns of the new data frame with contain the same names, with the white spaces too.

aqm <- melt(airquality, id=c("Month", "Day"), na.rm=TRUE)
aqm$variable2<-factor(aqm$variable, levels=levels(aqm$variable),
labels=gsub("Ozone", "ozone with space", levels(aqm$variable)))
df<-dcast(aqm, day ~ variable2)
colnames(df)

This is in itself not a big problem, but makes some ways of indexing columns no more possible, or adds difficulties to handle with ggplot. The solution is very simple, involving only:
colnames(df) <- check.names(colnames(df))

it would be nice however if this could be directly made an argument of dcast() & co, making the code more compact and elegant!

Thanks!!

(this was discussed on manipulatr: http://groups.google.com/group/manipulatr/browse_thread/thread/60579a3cb89253b/b5738362f7c08a97?lnk=gst&q=stigler#b5738362f7c08a97)

... and a custom value column does not work

library(reshape)
df <- data.frame(l = letters[1:3], v = rnorm(3))
cast(df, ... ~ .)

melt on table does not carry names

The melt does not carry names correctly for the table class, from the table function. I think that this is a dispatching problem.
Here is an example.

library(reshape2)
data(Titanic)  # table class object
melt(Titanic)  # does not carry names
reshape2:::melt.array(Titanic) # carries names

I think that the solution would be to add a melt.table <- melt.array and declare the new s3 method.

`melt.data.frame()` Forcing the `variable` column to be of type character

Hi,
There is a couple years old question on StackOverflow that describes a slight nuisance of melt.data.frame() returning variable column as a factor and not as a character. Is it possible to include a switch to control this behavior?

No non-missing arguments warning when using min or max in reshape2

Looks like a small bug in plyr::vaggregate when using it with min/max functions.

For more details you can see this SO post: http://stackoverflow.com/questions/24282550/no-non-missing-arguments-warning-when-using-min-or-max-in-reshape2

acast/dcast should error when value_var doesn't exist

Current error message is not informative:

library(reshape2)
dcast(airquality, month ~ day, value_var = "test")

Also check documentation.

Segmentation fault in dcast on a largish data set with plyr 1.8

See https://gist.github.com/krlmlr/10720042 for an example. Crashes on my machine with plyr 1.8, doesn't crash with plyr 1.8.1. (Also doesn't crash with N <- 10000L or smaller.) This affects both CRAN reshape2 and current GitHub master 9f5b7ff. Crash message:

> d.c <- dcast(d.m, ...~f, sum, value.var="w")
Error: segfault from C stack overflow

Probably the easiest fix would be just to depend on plyr 1.8.1.

> sessionInfo()
R version 3.1.0 (2014-04-10)
Platform: x86_64-pc-linux-gnu (64-bit)

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] reshape2_1.3.0.99     ProjectTemplate_0.5-1

loaded via a namespace (and not attached):
[1] plyr_1.8.1    Rcpp_0.11.1   stringr_0.6.2

FR: variable.factor toggle for melt.array

Suppose you want to extract the upper triangle of a correlation matrix (example on SO). I would try...

melt(m) %>% filter(Var2 > Var1)

but inequalities don't work with factors. Maybe a variable.factor toggle can be added to melt, as in melt.data.table.

colsplit should allow use of Perl regular expressions

Motivating example: consider this cut-down data frame

X <- structure(list(
    origin = structure(1L, .Label = c("c"), class = "factor"),
    cluster = structure(1L, .Label = c("3"), class = "factor"),
    n = structure(1L, .Label = c("1"), class = "factor"),
    distance = 0.0781457901901654, t0 = 0, t1 = 0, t2 = 0, t3 = 0.1125,
    t4 = 0.09, t5 = 0.241666666666667, t6 = 0.35, t7 = 0.43125,
    t8 = 0.494444444444444, t9 = 0.545, t10 = 0.586363636363636,
    t11 = 0.620833333333333, t12 = 0.65, t13 = 0.675,
    t14 = 0.696666666666667, t15 = 0.715625, t16 = 0.732352941176471,
    t17 = 0.747222222222222, t18 = 0.760526315789474, t19 = 0.7725,
    l0 = 132L, l1 = 198L, l2 = 309L, l3 = 1353L, l4 = 74L, l5 = 586L,
    l6 = 586L, l7 = 586L, l8 = 586L, l9 = 586L, l10 = 586L, l11 = 586L,
    l12 = 586L, l13 = 586L, l14 = 1172L, l15 = 586L, l16 = 586L,
    l17 = 586L, l18 = 586L, l19 = 586L), .Names = c("origin",
    "cluster", "n", "distance", "t0", "t1", "t2", "t3", "t4", "t5",
    "t6", "t7", "t8", "t9", "t10", "t11", "t12", "t13", "t14", "t15",
    "t16", "t17", "t18", "t19", "l0", "l1", "l2", "l3", "l4", "l5",
    "l6", "l7", "l8", "l9", "l10", "l11", "l12", "l13", "l14", "l15",
    "l16", "l17", "l18", "l19"), row.names = 5L, class = "data.frame")

The "measure variables" are tXX and lXX. I want to give the t's and the l's each their own column. This would be as simple as

Y <- melt(X, id.vars=c('origin','cluster','n','distance'))
Y <- cbind(Y, colsplit(Y$variable, split="(?<=[lt])", perl=TRUE,
                       names=c("var", "i")))
cast(Y, origin + cluster + distance + n + i ~ var)

if colsplit supported perl=TRUE, but it doesn't, and without that I see no way to do this short of reimplementing colsplit myself. The underlying strsplit does support perl=TRUE so it's as simple as accepting the argument and passing it down. (Note: in the present codebase, strsplit seems to have been replaced with a function str_split_fixed whose definition I cannot find.)

cast.Rd documentation error

cast.Rd refers to margin options that are apparently deprecated:

item{margins}{vector of variable names (can include "grand_col" and
"grand_row") to compute margins for, or TRUE to compute all margins .
Any variables that can not be margined over will be silently dropped.}

Suggested change:

item{margins}{vector of variable names (if TRUE compute all margins) for which to compute margins. Any variables that cannot be margined over will be silently dropped.}

All melt methods should have `na.rm` argument

Probably should add to generic.

recast outputs a list instead of a data.frame or array

I'm not sure if this is an issue or a feature, but recast in "reshape2" results in a list with the "data" and "label" attributes.

See for example: http://stackoverflow.com/a/23161231/1270695

As noted in the comments at that question, this was not the behavior with "reshape". Compare:

library(reshape)
library(reshape2)
reshape::recast(french_fries, time ~ variable, id.var=1:4)
reshape2::recast(french_fries, time ~ variable, id.var=1:4)

Numeric variables with attributes cause melt.data.frame to fail

library(reshape2)

structure(list(ID = structure(1:10, 
                              .Label = c("A", "B", "C", "D", "E", "F", "G", "H", "I", "J"), 
                              class = "factor"), 
               AGE = c(68L, 63L, 55L, 64L, 60L, 78L, 60L, 62L, 60L, 75L), 
               BMI = c(25L, 27L, 27L, 28L, 32L, NA, 36L, 27L, 31L, 25L), 
               EventDays = c(722L, 738L, 707L, 751L, 735L, 728L, 731L, 717L, 728L, 735L), 
               InterventionDays = c(NA, NA, 575, NA, NA, NA, 490, 643, NA, NA)), 
          .Names = c("ID", "AGE", "BMI", "EventDays", "InterventionDays"), 
          row.names = c(NA, -10L), 
          class = "data.frame")

melt(D, c("ID", "AGE", "BMI")) ## works

attr(D$ID, "label") <- "ID number"  ## add attribute to factor
melt(D, c("ID", "AGE", "BMI")) ## works

attr(D$AGE, "label") <- "Age" ## add attribute to numeric variable
melt(D, c("ID", "AGE", "BMI")) ## does not work

# Error in data.frame(ids, variable, value, stringsAsFactors = FALSE) : 
#   arguments imply differing number of rows: 10, 20

sort_df() and decreasing order?

It appears that sort_df() does not have an option for using variables in decreasing order, which is in sort() in base.

This code should do the trick.

sort_df = function (data, vars = names(data), decreasing = F)
{
  if (length(vars) == 0 || is.null(vars))
    return(data)
  data[do.call("order", list(what = data[, vars, drop = FALSE], decreasing = decreasing)), , drop = FALSE]
}

Tested with:

sort_df(iris, "Sepal.Length")
sort_df(iris, "Sepal.Length", decreasing = T)

dcast formula fails to resolve ... when column names have spaces

library(reshape2)
aqm <- melt(airquality, id=c("month", "day"), na.rm=TRUE)
colnames(aqm) <- c("my month" , "my day", "variable", "value")
dcast(aqm, ... ~ variable, value.var = "value")

Throws exception

Error in parse(text = x) : <text>:1:4: unexpected symbol
1: my month
       ^

This becomes a problem when I read data using readr which preserves the column names as-is (I like your not messing with my column names by the way). Now if I want to do any dcast using ... shorthand, it would fail. I can work around by constructing the formula manually/programmatically and quote the symbol properly.

cast calls fun.aggregate initially with an empty object

cast(data.frame(value=rep(0:1, each=10)), . ~ ., function(x) print(length(x)))

yields

[1] 0
[1] 20
  value (all)
1 (all)    20

Hence, some functions return an error

cast(data.frame(value=rep(0:1, each=10)), . ~ ., function(x) binom.test(sum(x), length(x))$p.value)

Error in binom.test(sum(x), length(x)) : 
  'n' must be a positive integer >= 'x'

Of course, using try(binom.test...) would be workaround.

reshape2 calls melt.data.frame function incorrectly if the reshape package was loaded

If the reshape package is loaded at the same environment that the package reshape2 then reshape2::melt function will incorrectly call reshape::melt.data.frame function instead of reshape2::melt.data.frame. An example below:

df <- data.frame(a=1, b=2, c=3)
library(reshape2)
reshape2::melt(df, id.vars = "a", variable.name = "NAME")
library(reshape)
reshape2::melt(df, id.vars = "a", variable.name = "NAME")

The output is different between function calls because variable.name argument is defined as variable_name on reshape package. This issue might break package that uses reshape2 instead of reashape package.

POSIXct values become numeric in reshape2 dcast

When using dcast where the value.var is a POSIXct type, the date values lose their POSIXct class and become numeric.

Per thread here, underlying cause appears to be the as.vector call within as.data.frame.matrix.

Example session:

> x <- c("a","b");
> y <- c("c","d");
> z <- as.POSIXct(c("2012-01-01 01:01:01","2012-02-02 02:02:02"));
> d <- data.frame(x, y, z, stringsAsFactors=FALSE);
> str(d);
'data.frame':   2 obs. of  3 variables:
 $ x: chr  "a" "b"
 $ y: chr  "c" "d"
 $ z: POSIXct, format: "2012-01-01 01:01:01" "2012-02-02 02:02:02"
> library(reshape2);
> e <- dcast(d, formula = x ~ y, value.var = "z");
> str(e);
'data.frame':   2 obs. of  3 variables:
 $ x: chr  "a" "b"
 $ c: num  1.33e+09 NA
 $ d: num  NA 1.33e+09

error in dcast when non atomic columns present

Hi,
I have a data frame with two columns which are lists of raw vectors. When I try to run dcast on it I get this error:

Error in sort.int(x, na.last = na.last, decreasing = decreasing, ...) : 
  'x' must be atomic

no matter what formula I enter. I was wondering if there is a way to make this work. The larger context is that I am reading data from HBase. It originally comes out in melted form, with columns: key (a raw vector), family, columns (these two raw but can be converted to character) and value (again a raw vector). The first three columns uniquely identify a value. The key and value columns are serialized objects, possibly complex (say nested lists). Thanks

Antonio

dcast converts characters to factors when margins are given

Dear Hadley,

I realized a strange behavior with characters being converted to factors by dcast()
but only if "margins" is given:

library(reshape2)

df <- data.frame(Time=0:11, Group=rep(c("Group.1","Group.2","Group.3"), each=4), 
                 Value=1:12, stringsAsFactors=FALSE)
sapply(1:3, function(j) class(df[,j])) # okay, fine

df. <- dcast(df, Time + Group ~ "Value", fun.aggregate=sum, value_var="Value") 
sapply(1:3, function(j) class(df.[,j])) # okay, still fine

df.. <- dcast(df, Time + Group ~ "Value", fun.aggregate=sum, value_var="Value", margins="Group") 
sapply(1:3, function(j) class(df..[,j])) # problem: the Group column is converted to factor

It might be due to some newly created data.frame... Maybe dcast should accept a
stringsAsFactors argument?

Cheers,

Marius

PS: Another "nice-to-have" would be if the user could provide an alternative string for "(all)" when creating margins.

var names for melt.table / melt.matrix

xvar <- iris$Species
yvar <- iris$Species
mytable <- table(xvar,yvar)
melt(mytable)

The names of the melted data frame are NA, regardless of which arguments are passed as value.name or varnames. In reshape1 this was not the case.

dcast provides incorrect margin totals when subsetting on formula variable

e.g. in R 3.1.2 and reshape2 1.4.1:

> library(reshape2)
> library(plyr)
> dcast(mtcars,
            gear ~ carb,
            fun.aggregate = length,
            value.var = "cyl", 
            drop = FALSE,
            subset = .(gear != 4),
            margins = c("gear","carb"))

 gear 1  2 3  4 6 8 (all)
1     3 3  4 3  5 0 0    15
2     4 0  0 0  0 0 0     0
3     5 0  2 0  1 1 1     5
4 (all) 7 10 3 10 1 1    32

> dcast(mtcars[mtcars$gear != 4,],
            gear ~ carb,
            fun.aggregate = length,
            value.var = "cyl",
            drop = FALSE,
            margins = c("gear","carb"))

   gear 1 2 3 4 6 8 (all)
1     3 3 4 3 5 0 0    15
2     5 0 2 0 1 1 1     5
3 (all) 3 6 3 6 1 1    20

The reason is that the margins are calculated by replicating the data frame up front with the added level (all), and then the subset is taken in cast. If the subset requested is on a formula variable that we want a margin on (particularly in the case of !=) you end up with all of the original rows with the value (all) for all margins.

Either the subset needs to be taken up front, prior to entering cast, or some sort of indexing id variable needs to be created in add_margins so that you can be sure that you're getting the correct subset in terms of the original data set.

If there is no palatable solution, it ought to at least be documented.

Call type.convert after dcast

It would be great if dcast optionally called type.convert on each resulting column. Perhaps triggered by a new parameter as.is that defaults to NA (=no type.convert) and otherwise is passed to the type.convert call?

Sometimes data is given as a key-value list; if the values contain non-numerics, everything is a character after dcast.

dcast and enviroments

When using knitr with caching, I came across a problem associated with dcast.

It is reported in yihui/knitr #207 and may be the same as
ggplot2 #377

eval(parse(text = "
library(reshape2)
df = data.frame(x = letters[1:2], y = letters[1:3], z = rnorm(6))
g = c('b', 'a')
dcast(df, y~ordered(x, levels = g))
"), envir = new.env())

which gives

Using z as value column: use value.var to override.
Error in factor(x, ..., ordered = TRUE) : object 'g' not found

Error (and crash) in dcast with NA values when drop=FALSE

I’ve found a bug in dcast in reshape, where NA values are not correctly handled if drop=FALSE. Example:

library(reshape2)
d = data.frame(x = c(10, 10), y = c("A", NA), z=1:2)
# d$y = as.character(d$y)
d2 = melt(d, measure.vars="z")
dcast(d, x+y~., fun=sum, value.var="z", drop=FALSE)

The actual error message depends on whether y is a factor or a character vector (run the commented line). When it’s a factor vector, the error message says:

Error in split_indices(seq_along(.value), .group, .n) :
   INTEGER() can only be applied to a 'integer', not a 'pairlist'

When it’s a character vector, it says:

Error: nrow(res$labels[[1]]) == nrow(data) is not TRUE

And the most surprising thing is that if y is a factor vector and I repeatedly run the dcast line, R actually crashes. This is 100% reproducible: One Windows it crashes the third time the dcast line is run, and on Linux it crashes the first time. I can also reproduce it with the latest Git version.

--please do not edit the information below--

Package: reshape2
Version: 1.2.1
Maintainer: Hadley Wickham [email protected]
Built: R 2.15.1; ; 2012-06-23 16:36:47 UTC; windows

R Version:
platform = i386-pc-mingw32
arch = i386
os = mingw32
system = i386, mingw32
status =
major = 2
minor = 15.1
year = 2012
month = 06
day = 22
svn rev = 59607
language = R
version.string = R version 2.15.1 (2012-06-22)
nickname = Roasted Marshmallows

Windows XP (build 2600) Service Pack 3

Locale:
LC_COLLATE=Norwegian-Nynorsk_Norway.1252;LC_CTYPE=Norwegian-Nynorsk_Norway.1252;LC_MONETARY=Norwegian-Nynorsk_Norway.1252;LC_NUMERIC=C;LC_TIME=Norwegian-Nynorsk_Norway.1252

Search Path:
.GlobalEnv, package:reshape2, package:stats, package:graphics,
package:grDevices, package:datasets, package:utils, package:methods,
Autoloads, package:base

Work better with non-numeric data

td <- data.frame(
  W=LETTERS[1:10], 
  X = 1:10, 
  Y = letters[11:20], 
  Z = letters[20:11])

library(reshape2)
td.m2 <- melt(td, measure.vars=c("Y", "Z"))
td.c2 <- dcast(td.m2, ... ~ variable)

If not numeric, convert everything to character, and then apply type.convert?

Issue with as.is in melt.array

Hello: I am assuming this is related to issue #30. I would expect the following to return "character", not "factor".

exampledat <- matrix(as.character(1:6), 2, 3)
sapply(melt(exampledat, as.is = TRUE),  class)

Thanks for all of your great contributions to the R community.

Default fill value doesn't work with multi-valent aggregation

library(reshape)
dat <- data.frame(ID=c(1,1,2),VAL=c(3,3,4))
cast(dat,ID~.,value='VAL',fun=function(x)list(min=min(x)))
cast(dat,ID~.,value='VAL',fun=function(x)list(sd=sd(x)))
cast(dat,ID~.,value='VAL',fun=function(x)list(min=min(x),sd=sd(x)))

melt.table() does not preserve dimension names

melt.table() does not preserve dimnames (and setting the varnames appropriately). This is adapted from the examples in melt.table():

a <- array(c(1:11, NA), c(2,3,2))
a <- as.table(a)

# Melt with no extra args
melt(a)
#    Var1 Var2 Var3 value
#1     1    1    1     1
#2     2    1    1     2
#3     1    2    1     3
#4     2    2    1     4
#5     1    3    1     5
#6     2    3    1     6
#7     1    1    2     7
#8     2    1    2     8
#9     1    2    2     9
#10    2    2    2    10
#11    1    3    2    11
#12    2    3    2    NA

# Specifying varnames explicitly works
melt(a, varnames=c("X","Y","Z"))
#    X Y Z value
#1  1 1 1     1
#2  2 1 1     2
#3  1 2 1     3
#4  2 2 1     4
#5  1 3 1     5
#6  2 3 1     6
#7  1 1 2     7
#8  2 1 2     8
#9  1 2 2     9
#10 2 2 2    10
#11 1 3 2    11
#12 2 3 2    NA

When you assign dimnames to a, it has no effect on the output of melt:

dimnames(a) <-  lapply(list(X=2, Y=3, Z=2), function(x) LETTERS[1:x])
melt(a)
#    value NA NA NA
#1      A  A  A  1
#2      B  A  A  2
#3      A  B  A  3
#4      B  B  A  4
#5      A  C  A  5
#6      B  C  A  6
#7      A  A  B  7
#8      B  A  B  8
#9      A  B  B  9
#10     B  B  B 10
#11     A  C  B 11
#12     B  C  B NA