Thursday, 15 April 2010

R aggregate all possible combinations incl. "don't cares" -



R aggregate all possible combinations incl. "don't cares" -

say we've got dataframe 3 columns representing 3 different cases, , each can of state 0 or 1. 4th column contains measurement.

set.seed(123) df <- data.frame(round(runif(25)), round(runif(25)), round(runif(25)), runif(25)) colnames(df) <- c("v1", "v2", "v3", "x") head(df) v1 v2 v3 x 1 0 1 0 0.2201189 2 1 1 0 0.3798165 3 0 1 1 0.6127710 aggregate(df$x, by=list(df$v1, df$v2, df$v3), fun=mean) group.1 group.2 group.3 x 1 0 0 0 0.1028646 2 1 0 0 0.5081943 3 0 1 0 0.4828984 4 1 1 0 0.5197925 5 0 0 1 0.4571073 6 1 0 1 0.3219217 7 0 1 1 0.6127710 8 1 1 1 0.6029213

the aggregate function calculates mean possible combinations. however, in research need know outcome of combinations, columns may have state. example, mean of observations v1==1 & v2==1, regardless contents of v3. result should this, asterisk representing "don't care":

group.1 group.2 group.3 x 1 * * * 0.1234567 (this mean of rows) 2 0 * * 0.1234567 3 1 * * 0.1234567 4 * 0 * 0.1224567 5 * 1 * 0.1234567 [ other possible combinations follow, should total of 27 rows ]

is there easy way accomplish this?

here ldply-ddply method:

library(plyr) ldply(list(.(v1,v2,v3),.(v1),.(v2),.()), function(y) ddply(df,y,summarise,x=mean(x))) v1 v2 v3 x .id 1 0 0 0 0.1028646 <na> 2 0 0 1 0.4571073 <na> 3 0 1 0 0.4828984 <na> 4 0 1 1 0.6127710 <na> 5 1 0 0 0.5081943 <na> 6 1 0 1 0.3219217 <na> 7 1 1 0 0.5197925 <na> 8 1 1 1 0.6029213 <na> 9 0 na na 0.4436400 <na> 10 1 na na 0.4639997 <na> 11 na 0 na 0.4118793 <na> 12 na 1 na 0.5362985 <na> 13 na na na 0.4566702 <na>

essentially create list of variable combinations interested in, , iterate on ldply , using ddply perform aggreation. magic of plyr puts compact dataframe you. remains remove spurious .id column introduced grand mean (.()) , replace nas in groups "*" if needed.

to combinations can utilize combn , lapply generate list relevant combinations plug ldply:

all.combs <- unlist(lapply(0:3,combn,x=c("v1","v2","v3"),simplify=false),recursive=false) ldply(all.combs, function(y) ddply(df,y,summarise,x=mean(x))) .id x v1 v2 v3 1 <na> 0.4566702 na na na 2 <na> 0.4436400 0 na na 3 <na> 0.4639997 1 na na 4 <na> 0.4118793 na 0 na 5 <na> 0.5362985 na 1 na 6 <na> 0.4738541 na na 0 7 <na> 0.4380543 na na 1 8 <na> 0.3862588 0 0 na 9 <na> 0.5153666 0 1 na 10 <na> 0.4235250 1 0 na 11 <na> 0.5530440 1 1 na 12 <na> 0.3878900 0 na 0 13 <na> 0.4882400 0 na 1 14 <na> 0.5120604 1 na 0 15 <na> 0.4022073 1 na 1 16 <na> 0.4502901 na 0 0 17 <na> 0.3820042 na 0 1 18 <na> 0.5013455 na 1 0 19 <na> 0.6062045 na 1 1 20 <na> 0.1028646 0 0 0 21 <na> 0.4571073 0 0 1 22 <na> 0.4828984 0 1 0 23 <na> 0.6127710 0 1 1 24 <na> 0.5081943 1 0 0 25 <na> 0.3219217 1 0 1 26 <na> 0.5197925 1 1 0 27 <na> 0.6029213 1 1 1

r aggregate

No comments:

Post a Comment