r - How to use dplyr to eliminate for loops? -
does know of dplyr method doing pairwise matching on info missing observations followed subsequent arithmetic? below for-loop heavy code mwe in base, couldn't arms around dplyr approach (despite first-class vignettes , documentation).
in brief, code calculates dev, average of non-missing quantity observations q sold @ adjacent adj stores week.
edit: i'm interested in states divergent policies. allow vertical line below represent state boundary: counties 1, 2, , 3 in state (with policy a), , counties 4, 5, , 6 in state b (with policy b). counties may have multiple stores.
----|---- 1 | 4 |---- ----| 5 2 | ----|---- 3 | 6 ----|---- contig.id identifies county contiguous 1 or more counties in opposite state. example, county 1 (contig.id == 1) adjacent counties 4 , 5 in opposite state (adj1 == 4 , adj2 == 5), disregard county 2's geographic adjacency since 1 , 2 in same state.
by same method, county 4 (contig.id == 4) adjacent county 1 (adj1 == 1 , adj2 == na). end edit.
df <- data.frame(store = c(1001,1001,145,331,228,228,500,500,61,1135), end.week = c(20061125,20061118,20061125,20061125,20061125, 20061118,20061125,20061118,20061118,20061125), contig.id = c(1,1,2,3,4,4,4,4,5,na), adj1 = c(4,4,5,6,1,1,1,1,1,na), adj2 = c(5,5,na,na,na,na,na,na,2,na), q = c(12.25,14.5,18.75,16,16.5,22,55.25,8.25,24,37.75)) dev <- null dev1 <- null (i in 1:length(df$contig.id)) { temp1 <- integer(0) temp2 <- integer(0) if (is.na(df$contig.id[i]) == false) { temp1 <- which( (df$contig.id == df$adj1[i]) & (df$end.week == df$end.week[i])) if (length(temp1) > 0) { dev[i] <- sum(df$q[temp1]) } if (is.na(df$adj2[i]) == false) { temp2 <- which( (df$contig.id == df$adj2[i]) & (df$end.week == df$end.week[i]) ) if (length(temp2) > 0) { dev[i] <- dev[i] + sum(df$q[temp2]) } } } else { dev[i] <- na } dev[i] <- dev[i]/(length(temp1) + length(temp2)) dev1[i] <- (df$q[i])/dev[i] } df <- cbind(df,dev,dev1)
so have 3 kinds of info here, why needed such complex for-looping. i've tried normalize info 3 tables:
library(dplyr) library(tidyr) stores_time <- df %>% select(-contig.id,-adj1,-adj2) stores_space <- df %>% select(store,contig.id) %>% mutate(county = contig.id %>% paste0("c",.)) %>% select(-contig.id) %>% unique counties <- df %>% select(contig.id,adj1,adj2) %>% mutate(county = contig.id %>% paste0("c",.)) %>% select(-contig.id) %>% unique %>% gather(varname,adj_next_state,starts_with("adj")) %>% select(-varname) %>% mutate(adj_next_state = adj_next_state %>% paste0("c",.)) now have info on each store's sales on time (stores_time), each store's "location" in space (i.e. county in, stores_space) , info on adjacency of counties (counties). i've converted info wide long -- may come in handy, if have counties adjacent >2 other counties.
we can bring together of these together, obtain dataset of each store's performance in both "time" , "space":
stores_tsc <- stores_time %>% left_join(stores_space) %>% left_join(counties) to calculate dev, need bring together table onto itself. because, each store x time combination want average adjacent stores. when bring together table itself, need bring together county adj_next_state. can utilize select magic create easy:
stores_tsc %>% # rename 1 column select(store,end.week,county = adj_next_state) %>% # left bring together table # removing unneeded columns , using unique prevents duplicate rows. left_join(stores_tsc %>% select(-adj_next_state,-store) %>% unique, = c("county","end.week")) %>% # filter out store in unknown county filter(county != "cna") %>% # calculate dev group_by(store,end.week) %>% summarize(dev = mean(q,na.rm = true)) %>% ungroup %>% mutate(dev = ifelse(is.nan(dev), yes = na,no = dev)) store end.week dev 1 61 20061118 14.50000 2 145 20061125 na 3 228 20061118 14.50000 4 228 20061125 12.25000 5 331 20061125 na 6 500 20061118 14.50000 7 500 20061125 12.25000 8 1001 20061118 18.08333 9 1001 20061125 35.87500 you utilize merge stores_time calculate dev1 = q/dev
r for-loop spatial dplyr
No comments:
Post a Comment