Tuesday, 15 April 2014

Implement k-means clustering, accelerated using the triangle inequality, in Python (Scikit learn) -



Implement k-means clustering, accelerated using the triangle inequality, in Python (Scikit learn) -

i attempting run k-means clustering on big dataset (9106 items, 100 dimensions). makes slow have been recommended utilize triangle inequality described charles elkan (http://cseweb.ucsd.edu/~elkan/kmeansicml03.pdf).

is there pre-written function in toolbox this?

i've been using scikit learn, code follows:

#implement numpy array hold info data_array = np.empty([9106,100]) #iterate through info file anad add together numpy array rownum = 0 row in reader: if rownum != 0: print "rownum",rownum colnum = 0 col in row: if colnum !=0: data_array[rownum-1,colnum-1] = float(col) colnum+=1 rownum += 1 n_samples, n_features = data_array.shape n_digits = len(data_array) labels = none #digits.target #most of code below taken illustration on scikit larn site sample_size = 200 print "n_digits: %d, \t n_samples %d, \t n_features %d" % (n_digits, n_samples, n_features) len print 79 * '_' print ('% 9s' % 'init' ' time inertia homo compl v-meas ari ami silhouette') def bench_k_means(estimator, name, data): t0 = time() estimator.fit(data) print '% 9s %.2fs %i %.3f %.3f %.3f %.3f %.3f %.3f' % ( name, (time() - t0), estimator.inertia_, metrics.homogeneity_score(labels, estimator.labels_), metrics.completeness_score(labels, estimator.labels_), metrics.v_measure_score(labels, estimator.labels_), metrics.adjusted_rand_score(labels, estimator.labels_), metrics.adjusted_mutual_info_score(labels, estimator.labels_), metrics.silhouette_score(data, estimator.labels_, metric='euclidean', sample_size=sample_size), ) bench_k_means(kmeans(init='k-means++', k=n_digits, n_init=10), name="k-means++", data=data_array) bench_k_means(kmeans(init='random', k=n_digits, n_init=10), name="random", data=data_array) # in case seeding of centers deterministic, hence run # kmeans algorithm 1 time n_init=1 pca = pca(n_components=n_digits).fit(data_array) bench_k_means(kmeans(init=pca.components_, k=n_digits, n_init=1), name="pca-based", data=data_array) print 79 * '_'

is there pre-written function in toolbox this?

no. there's attempt @ algorithm wasn't merged master.

this makes slow

then seek minibatchkmeans before start hacking away @ complicated algorithms. it's orders of magnitude faster vanilla kmeans , good.

python scikit-learn k-means

No comments:

Post a Comment