Saturday, 15 May 2010

pentaho - Test multiple regex on each document -



pentaho - Test multiple regex on each document -

i getting documents mongodb collection (millions), , have lot of regex in postgresql. wanted test each regex until 1 match on multiple fields containded in documents.

do have thought how ?

i tried filter row step, can't figure how loop on regex postgresql.

you can solve problem using join rows (cartesian product) component. 1 of inputs have read in docs, other have read in regular expressions. bring together component create outer product these resulting in every possible combination of regex expressions , docs. stream have feed filter rows component , send result output.

the next transformation mimick approach (it reads csv files should not create difference reading postgresql or mungodb):

the input info "documents" configured follows:

the input info "regular expressions" configured follows:

the join rows not have configured @ since not provide bring together status , hence making total outer join.

in filter component have utilize doc_text , regex_text fields execute check base of operations upon regexp operator.

for document input

doc_id;doc_text 1;dfgbggg 2;uhlljal 3;jjjjhhh 4;fgakkbl

and regex input

regex_id;regex_text 1;.*a.* 2;.*b.*

the transformation output next result:

doc_id;doc_text;regex_id;regex_text 1;dfgbggg;2;.*b.* 2;uhlljal;1;.*a.* 4;fgakkbl;1;.*a.* 4;fgakkbl;2;.*b.*

pentaho kettle

No comments:

Post a Comment