r - DocumentTermMatrix fails with a strange error only when # terms > 3000 -
my code below works fine unless utilize create documenttermmatrix more 3000 terms. line:
movie_dict <- findfreqterms(movie_dtm_train, 8) movie_dtm_hifq_train <- documenttermmatrix(movie_corpus_train, list(dictionary = movie_dict)) fails with:
error in simple_triplet_matrix(i = i, j = j, v = as.numeric(v), nrow = length(allterms), : 'i, j, v' different lengths in addition: warning messages: 1: in mclapply(unname(content(x)), termfreq, control) : scheduled cores encountered errors in user code 2: in simple_triplet_matrix(i = i, j = j, v = as.numeric(v), nrow = length(allterms), : nas introduced coercion is there way can handle this? 3000*60000 matrix big documenttermmatrix? seems pretty little document classification though..
full code snippet:
n1 <- 60000 n2 <- 70000 #******* loading info ****************************************** #kaggle sentiment_analysis dataset movie_all <- read.delim('train.tsv', stringsasfactors=false) movie_raw <- movie_all[1:(n2),] #******* cleaning corpus *************************************** movie_corpus <- corpus(vectorsource(movie_raw$phrase)) movie_corpus_clean <- tm_map(movie_corpus, content_transformer(tolower)) movie_corpus_clean <- tm_map(movie_corpus_clean, removenumbers) movie_corpus_clean <- tm_map(movie_corpus_clean, removewords, stopwords()) movie_corpus_clean <- tm_map(movie_corpus_clean, removepunctuation) movie_corpus_clean <- tm_map(movie_corpus_clean, stripwhitespace) movie_dtm <- documenttermmatrix(movie_corpus_clean) #*********** break out info train/test sets ******************* movie_train <- movie_raw[1:(n1),] movie_corpus_train <- movie_corpus_clean[1:(n1)] movie_dtm_train <- movie_dtm[1:(n1),] #*********** remove rare words document term matrix *********** movie_dict <- findfreqterms(movie_dtm_train, 8) movie_dtm_hifq_train <- documenttermmatrix(movie_corpus_train, list(dictionary = movie_dict)) edit fails:
movie_dtm_hifq_train <- documenttermmatrix(movie_corpus_train[1:60000], list(dictionary = movie_dict)) but works:
d1 <- documenttermmatrix(movie_corpus_train[1:30000], list(dictionary = movie_dict)) d2 <- documenttermmatrix(movie_corpus_train[30000:60000], list(dictionary = movie_dict)) movie_dtm_hifq_train <- c(d1, d2) which leads me believe must size issue..
r sentiment-analysis tm document-classification
No comments:
Post a Comment