pentaho - Test multiple regex on each document -
i getting documents mongodb collection (millions), , have lot of regex in postgresql. wanted test each regex until 1 match on multiple fields containded in documents.
do have thought how ?
i tried filter row step, can't figure how loop on regex postgresql.
you can solve problem using join rows (cartesian product) component. 1 of inputs have read in docs, other have read in regular expressions. bring together component create outer product these resulting in every possible combination of regex expressions , docs. stream have feed filter rows component , send result output.
the next transformation mimick approach (it reads csv files should not create difference reading postgresql or mungodb):
the input info "documents" configured follows:
the input info "regular expressions" configured follows:
the join rows not have configured @ since not provide bring together status , hence making total outer join.
in filter component have utilize doc_text , regex_text fields execute check base of operations upon regexp operator.
for document input
doc_id;doc_text 1;dfgbggg 2;uhlljal 3;jjjjhhh 4;fgakkbl and regex input
regex_id;regex_text 1;.*a.* 2;.*b.* the transformation output next result:
doc_id;doc_text;regex_id;regex_text 1;dfgbggg;2;.*b.* 2;uhlljal;1;.*a.* 4;fgakkbl;1;.*a.* 4;fgakkbl;2;.*b.* pentaho kettle
No comments:
Post a Comment