Breeding: Implement k-means clustering, accelerated using the triangle inequality, in Python (Scikit learn) -

Tuesday, 15 April 2014

Implement k-means clustering, accelerated using the triangle inequality, in Python (Scikit learn) -

i attempting run k-means clustering on big dataset (9106 items, 100 dimensions). makes slow have been recommended utilize triangle inequality described charles elkan (http://cseweb.ucsd.edu/~elkan/kmeansicml03.pdf).

is there pre-written function in toolbox this?

i've been using scikit learn, code follows:

#implement numpy array hold   info data_array = np.empty([9106,100])  #iterate through   info file anad  add together numpy array rownum = 0 row in reader:     if rownum != 0:         print "rownum",rownum         colnum = 0         col in row:             if colnum !=0:                 data_array[rownum-1,colnum-1] = float(col)                 colnum+=1     rownum += 1  n_samples, n_features = data_array.shape n_digits = len(data_array) labels = none #digits.target   #most of code below taken  illustration on scikit  larn site sample_size = 200  print "n_digits: %d, \t n_samples %d, \t n_features %d" % (n_digits,                                                         n_samples, n_features) len  print 79 * '_' print ('% 9s' % 'init'       '    time  inertia    homo   compl  v-meas     ari     ami  silhouette')   def bench_k_means(estimator, name, data):     t0 = time()     estimator.fit(data)     print '% 9s   %.2fs    %i   %.3f   %.3f   %.3f   %.3f   %.3f    %.3f' % (          name, (time() - t0), estimator.inertia_,          metrics.homogeneity_score(labels, estimator.labels_),          metrics.completeness_score(labels, estimator.labels_),          metrics.v_measure_score(labels, estimator.labels_),          metrics.adjusted_rand_score(labels, estimator.labels_),          metrics.adjusted_mutual_info_score(labels,  estimator.labels_),          metrics.silhouette_score(data, estimator.labels_,                                   metric='euclidean',                                   sample_size=sample_size),          )   bench_k_means(kmeans(init='k-means++', k=n_digits, n_init=10),               name="k-means++", data=data_array)  bench_k_means(kmeans(init='random', k=n_digits, n_init=10),               name="random", data=data_array)  # in case seeding of centers deterministic, hence run # kmeans algorithm  1 time n_init=1 pca = pca(n_components=n_digits).fit(data_array) bench_k_means(kmeans(init=pca.components_, k=n_digits, n_init=1),               name="pca-based",               data=data_array) print 79 * '_'

is there pre-written function in toolbox this?

no. there's attempt @ algorithm wasn't merged master.

this makes slow

then seek minibatchkmeans before start hacking away @ complicated algorithms. it's orders of magnitude faster vanilla kmeans , good.

python scikit-learn k-means

Breeding

Tuesday, 15 April 2014

Implement k-means clustering, accelerated using the triangle inequality, in Python (Scikit learn) -

No comments:

Post a Comment