Monday, 15 March 2010

performance - (Fast) word frequency matrix in R -



performance - (Fast) word frequency matrix in R -

i writing r programme involves analyzing big amount of unstructured text info , creating word-frequency matrix. i've been using wfm , wfdf functions qdap package, have noticed bit slow needs. appears production of word-frequency matrix bottleneck.

the code function follows.

library(qdap) liwcr <- function(inputtext, dict) { if(!file.exists(dict)) stop("dictionary file not exist.") # read in dictionary categories # start figuring out category list begins , ends dictionarytext <- readlines(dict) if(!length(grep("%", dictionarytext))==2) stop("dictionary not formatted. create sure category list correctly partitioned (using '%').") catstart <- grep("%", dictionarytext)[1] catstop <- grep("%", dictionarytext)[2] dictlength <- length(dictionarytext) dictionarycategories <- read.table(dict, header=f, sep="\t", skip=catstart, nrows=(catstop-2)) wordcount <- word_count(inputtext) outputframe <- dictionarycategories outputframe["count"] <- 0 # read in dictionary words no_col <- max(count.fields(dict, sep = "\t"), na.rm=t) dictionarywords <- read.table(dict, header=f, sep="\t", skip=catstop, nrows=(dictlength-catstop), fill=true, quote="\"", col.names=1:no_col) workingmatrix <- wfdf(inputtext) (i in workingmatrix[,1]) { if (i %in% dictionarywords[, 1]) { occurrences <- 0 foundword <- dictionarywords[dictionarywords$x1 == i,] foundcategories <- foundword[1,2:no_col] (w in foundcategories) { if (!is.na(w) & (!w=="")) { existingcount <- outputframe[outputframe$v1 == w,]$count outputframe[outputframe$v1 == w,]$count <- existingcount + workingmatrix[workingmatrix$words == i,]$all } } } } return(outputframe) }

i realize loop inefficient, in effort locate bottleneck, tested without portion of code (simply reading in each text file , producing word-frequency matrix), , seen little in way of speed improvements. example:

library(qdap) fn <- reports::folder(delete_me) n <- 10000 lapply(1:n, function(i) { out <- paste(sample(key.syl[[1]], 30, t), collapse = " ") cat(out, file=file.path(fn, sprintf("tweet%s.txt", i))) }) filename <- sprintf("tweet%s.txt", 1:n) for(i in 1:length(filename)){ print(filename[i]) text <- readlines(paste0("/toshi/twitter_en/", filename[i])) freq <- wfm(text) }

the input files twitter , facebook status postings.

is there way improve speed code?

edit2: due institutional restrictions, can't post of raw data. however, give thought of i'm dealing with: 25k text files, each available tweets individual twitter user. there additional 100k files facebook status updates, structured in same way.

here qdap approach , mixed qdap/tm approach faster. provide code , timings on each. read in @ 1 time , operator on entire info set. split apart if wanted split.

a mwe should provide questions

library(qdap) fn <- reports::folder(delete_me) n <- 10000 lapply(1:n, function(i) { out <- paste(sample(key.syl[[1]], 30, t), collapse = " ") cat(out, file=file.path(fn, sprintf("tweet%s.txt", i))) }) filename <- sprintf("tweet%s.txt", 1:n)

the qdap approach

tic <- sys.time() ## time dat <- list2df(setnames(lapply(filename, function(x){ readlines(file.path(fn, x)) }), tools::file_path_sans_ext(filename)), "text", "tweet") difftime(sys.time(), tic) ## time read in the_wfm <- with(dat, wfm(text, tweet)) difftime(sys.time(), tic) ## time create wfm

timing qdap approach

> tic <- sys.time() ## time > > dat <- list2df(setnames(lapply(filename, function(x){ + readlines(file.path(fn, x)) + }), tools::file_path_sans_ext(filename)), "text", "tweet") there 50 or more warnings (use warnings() see first 50) > > difftime(sys.time(), tic) ## time read in time difference of 2.97617 secs > > the_wfm <- with(dat, wfm(text, tweet)) > > difftime(sys.time(), tic) ## time create wfm time difference of 48.9238 secs

the qdap-tm combined approach

tic <- sys.time() ## time dat <- list2df(setnames(lapply(filename, function(x){ readlines(file.path(fn, x)) }), tools::file_path_sans_ext(filename)), "text", "tweet") difftime(sys.time(), tic) ## time read in tweet_corpus <- with(dat, as.corpus(text, tweet)) tdm <- tm::termdocumentmatrix(tweet_corpus, command = list(removepunctuation = true, stopwords = false)) difftime(sys.time(), tic) ## time create termdocumentmatrix

timing qdap-tm combined approach

> tic <- sys.time() ## time > > dat <- list2df(setnames(lapply(filename, function(x){ + readlines(file.path(fn, x)) + }), tools::file_path_sans_ext(filename)), "text", "tweet") there 50 or more warnings (use warnings() see first 50) > > difftime(sys.time(), tic) ## time read in time difference of 3.108177 secs > > > tweet_corpus <- with(dat, as.corpus(text, tweet)) > > tdm <- tm::termdocumentmatrix(tweet_corpus, + command = list(removepunctuation = true, + stopwords = false)) > > difftime(sys.time(), tic) ## time create termdocumentmatrix time difference of 13.52377 secs

there qdap-tm bundle compatibility (-click here-) help users move between qdap , tm. can see on 10000 tweets combined approach ~3.5 x faster. purely tm approach may faster still. if want wfm utilize as.wfm(tdm) coerce termdocumentmatrix.

your code though slower either way because it's not r way things. i'd recommend reading additional info on r improve @ writing faster code. i'm working through hadley wickham's advanced r i'd recommend.

r performance text-analysis word-frequency qdap

No comments:

Post a Comment