Breeding: performance - (Fast) word frequency matrix in R -

Monday, 15 March 2010

performance - (Fast) word frequency matrix in R -

i writing r programme involves analyzing big amount of unstructured text info , creating word-frequency matrix. i've been using wfm , wfdf functions qdap package, have noticed bit slow needs. appears production of word-frequency matrix bottleneck.

the code function follows.

library(qdap) liwcr <- function(inputtext, dict) {   if(!file.exists(dict))      stop("dictionary file not exist.")    # read in dictionary categories   # start figuring out category list begins , ends   dictionarytext <- readlines(dict)   if(!length(grep("%", dictionarytext))==2)     stop("dictionary not formatted.  create sure category list correctly partitioned (using '%').")    catstart <- grep("%", dictionarytext)[1]   catstop <- grep("%", dictionarytext)[2]   dictlength <- length(dictionarytext)    dictionarycategories <- read.table(dict, header=f, sep="\t", skip=catstart, nrows=(catstop-2))    wordcount <- word_count(inputtext)    outputframe <- dictionarycategories   outputframe["count"] <- 0    # read in dictionary words    no_col <- max(count.fields(dict, sep = "\t"), na.rm=t)   dictionarywords <- read.table(dict, header=f, sep="\t", skip=catstop, nrows=(dictlength-catstop), fill=true, quote="\"", col.names=1:no_col)    workingmatrix <- wfdf(inputtext)   (i in workingmatrix[,1]) {     if (i %in% dictionarywords[, 1]) {       occurrences <- 0       foundword <- dictionarywords[dictionarywords$x1 == i,]       foundcategories <- foundword[1,2:no_col]       (w in foundcategories) {         if (!is.na(w) & (!w=="")) {           existingcount <- outputframe[outputframe$v1 == w,]$count           outputframe[outputframe$v1 == w,]$count <- existingcount + workingmatrix[workingmatrix$words == i,]$all         }       }     }   }   return(outputframe) }

i realize loop inefficient, in effort locate bottleneck, tested without portion of code (simply reading in each text file , producing word-frequency matrix), , seen little in way of speed improvements. example:

library(qdap) fn <- reports::folder(delete_me) n <- 10000  lapply(1:n, function(i) {     out <- paste(sample(key.syl[[1]], 30, t), collapse = " ")     cat(out, file=file.path(fn, sprintf("tweet%s.txt", i))) })  filename <- sprintf("tweet%s.txt", 1:n)  for(i in 1:length(filename)){   print(filename[i])   text <- readlines(paste0("/toshi/twitter_en/", filename[i]))   freq <- wfm(text) }

the input files twitter , facebook status postings.

is there way improve speed code?

edit2: due institutional restrictions, can't post of raw data. however, give thought of i'm dealing with: 25k text files, each available tweets individual twitter user. there additional 100k files facebook status updates, structured in same way.

here qdap approach , mixed qdap/tm approach faster. provide code , timings on each. read in @ 1 time , operator on entire info set. split apart if wanted split.

a mwe should provide questions

library(qdap) fn <- reports::folder(delete_me) n <- 10000  lapply(1:n, function(i) {     out <- paste(sample(key.syl[[1]], 30, t), collapse = " ")     cat(out, file=file.path(fn, sprintf("tweet%s.txt", i))) })  filename <- sprintf("tweet%s.txt", 1:n)

the qdap approach

tic <- sys.time() ## time  dat <- list2df(setnames(lapply(filename, function(x){     readlines(file.path(fn, x)) }), tools::file_path_sans_ext(filename)), "text", "tweet")  difftime(sys.time(), tic) ## time read in  the_wfm <- with(dat, wfm(text, tweet))  difftime(sys.time(), tic)  ## time  create wfm

timing qdap approach

> tic <- sys.time() ## time >  > dat <- list2df(setnames(lapply(filename, function(x){ +     readlines(file.path(fn, x)) + }), tools::file_path_sans_ext(filename)), "text", "tweet") there 50 or more warnings (use warnings() see first 50) >  > difftime(sys.time(), tic) ## time read in time difference of 2.97617 secs >  > the_wfm <- with(dat, wfm(text, tweet)) >  > difftime(sys.time(), tic)  ## time  create wfm time difference of 48.9238 secs

the qdap-tm combined approach

tic <- sys.time() ## time  dat <- list2df(setnames(lapply(filename, function(x){     readlines(file.path(fn, x)) }), tools::file_path_sans_ext(filename)), "text", "tweet")  difftime(sys.time(), tic) ## time read in   tweet_corpus <- with(dat, as.corpus(text, tweet))  tdm <- tm::termdocumentmatrix(tweet_corpus,      command = list(removepunctuation = true,     stopwords = false))  difftime(sys.time(), tic)  ## time  create termdocumentmatrix

timing qdap-tm combined approach

> tic <- sys.time() ## time >  > dat <- list2df(setnames(lapply(filename, function(x){ +     readlines(file.path(fn, x)) + }), tools::file_path_sans_ext(filename)), "text", "tweet") there 50 or more warnings (use warnings() see first 50) >  > difftime(sys.time(), tic) ## time read in time difference of 3.108177 secs >  >  > tweet_corpus <- with(dat, as.corpus(text, tweet)) >  > tdm <- tm::termdocumentmatrix(tweet_corpus, +      command = list(removepunctuation = true, +     stopwords = false)) >  > difftime(sys.time(), tic)  ## time  create termdocumentmatrix time difference of 13.52377 secs

there qdap-tm bundle compatibility (-click here-) help users move between qdap , tm. can see on 10000 tweets combined approach ~3.5 x faster. purely tm approach may faster still. if want wfm utilize as.wfm(tdm) coerce termdocumentmatrix.

your code though slower either way because it's not r way things. i'd recommend reading additional info on r improve @ writing faster code. i'm working through hadley wickham's advanced r i'd recommend.

r performance text-analysis word-frequency qdap

Breeding

Monday, 15 March 2010

performance - (Fast) word frequency matrix in R -

No comments:

Post a Comment