Sunday, 15 May 2011

Cassandra: Is this proper schema for the data model? -



Cassandra: Is this proper schema for the data model? -

in sensor-based application, 300k objects beingness monitored per hr on 30 metrics each having success , failure counters.

my schema:

create table measurements( objid int, hr timestamp, metric text, succ int, fail int, primary key (objid, hour, metric));

data retention period within 1 year, way table have 300k rows each having 24*360*30*2 columns(cells).

usual queries counter values aggregated on specified time interval (could days, weeks, months) , specified objects (ranging 1 hundreds).

time slicing ok column slicing, while retrieval of multiple objects bit pain, since rows keyed per object objid , lead multiget.

the general query can think of is:

select * measurements objid in (id1, id2, id3...idn) , hr >= <starttime> , hr < <endtime>;

of course of study aggregation have done manually in application.

q: optimal way construction info given query pattern?

worst case 'overall' result on period, means taking objects account. mean, perspective, total table scan. recommended practice perform such task w/o resorting mapreduce?

if know typically restricting subset of time , possible set of objects within each hr may sparse, might consider reversing index order, time first dimension. way, picking out columns restricted set of rows, still need multi-get, if querying objects common, number of rows may smaller.

if typically query/aggregate different granularities of time, store duplicate info @ higher granularities of time well, per day, week, month, etc. speed queries larger time scales. de-normalization friend in cassandra!

it's possible maintain around indices both orderings , take index based upon type of query performing.

cassandra schema

No comments:

Post a Comment