Friday, 15 January 2010

multithreading - Using CQLSSTableWriter concurrently -



multithreading - Using CQLSSTableWriter concurrently -

i'm trying create cassandra sstables results of batch computation in spark. ideally, each partition should create sstable info holds in order parallelize process much possible (and stream cassandra ring well)

after initial hurdles cqlsstablewriter (like requiring yaml file), i'm confronted issue:

java.lang.runtimeexception: attempting load loaded column family customer.rawts @ org.apache.cassandra.config.schema.load(schema.java:347) @ org.apache.cassandra.config.schema.load(schema.java:112) @ org.apache.cassandra.io.sstable.cqlsstablewriter$builder.fortable(cqlsstablewriter.java:336)

i'm creating author on each parallel partition this:

def store(rdd:rdd[message]) = { rdd.foreachpartition( msgiterator => { val author = cqlsstablewriter.builder() .indirectory("/tmp/cass") .fortable(schema) .using(insertsttmt).build() msgiterator.foreach(msg => {...}) })}

and if i'm reading exception correctly, can create 1 author per table in 1 jvm. guess writings writer not thread-safe , if contention multiple threads create having parallel tasks trying dump few gb of info disk @ same time defeat purpose of using sstables mass upload anyway.

so, there ways utilize cqlsstablewriter concurrently?

if not, next best alternative load batch info @ high throughput in cassandra?

as have observed, single author can used in serial (concurrentmodificationexceptions happen if not), , creating multiple writers in jvm fails due static schema construction within cassandra code sstablewriter uses.

i'm not aware of workaround other spawn multiple jvms, each writing separate directory.

we have filed cassandra jira ticket address issue.

https://issues.apache.org/jira/browse/cassandra-7463

multithreading cassandra bulkinsert apache-spark

No comments:

Post a Comment