Tuesday, 15 July 2014

regex - Finding words with doubled letters in HTML text with a regexp -



regex - Finding words with doubled letters in HTML text with a regexp -

how write regular look finds words doubled letters in document?

by doubled letters mean: "s in progress", "d , s in address", "o in tool" , on. want match these words within <body> part of html document ?

below bit of code shows trying do:

while (<>){ if (/<body(.*)>/ .. /<\/body>/){ foreach ($_){ print $_ =~ /\b\w{0,10}(\w)\1\w{0,10}\b/; } } }

this not obvious task, first off because parsing html regex hazardous. disclaimers doing so, here's regex job:

(?s)(?:<body>|\g)(?:.(?!</body>))*?\k\b\w*(\w)\1\w*\b

see the demo.

in perl:

@result = $subject =~ m%(?s)(?:<body>|\g)(?:.(?!</body>))*?\k\b\w*(\w)\1\w*\b%g; (?s) allows dot match newlines (?:<body>|\g) matches <body> or ending position of previous match (?:.(?!</body>))*? lazily matches chars not followed closing </body> tag \k tells engine drop had been matched far returned match \b\w*(\w)\1\w*\b matches word (without \b boundaries) made of optional chars \w* 1 captured char (\w) followed referenced grouping 1 captured \1 , more optional chars \w*

if want allow letters (no digits , underscores), replace \w [a-z] , replace (?s) (?is) create case-insensitive.

regex perl tags

No comments:

Post a Comment