regex - Finding words with doubled letters in HTML text with a regexp -
how write regular look finds words doubled letters in document?
by doubled letters mean: "s in progress", "d , s in address", "o in tool" , on. want match these words within <body> part of html document ?
below bit of code shows trying do:
while (<>){ if (/<body(.*)>/ .. /<\/body>/){ foreach ($_){ print $_ =~ /\b\w{0,10}(\w)\1\w{0,10}\b/; } } }
this not obvious task, first off because parsing html regex hazardous. disclaimers doing so, here's regex job:
(?s)(?:<body>|\g)(?:.(?!</body>))*?\k\b\w*(\w)\1\w*\b see the demo.
in perl:
@result = $subject =~ m%(?s)(?:<body>|\g)(?:.(?!</body>))*?\k\b\w*(\w)\1\w*\b%g; (?s) allows dot match newlines (?:<body>|\g) matches <body> or ending position of previous match (?:.(?!</body>))*? lazily matches chars not followed closing </body> tag \k tells engine drop had been matched far returned match \b\w*(\w)\1\w*\b matches word (without \b boundaries) made of optional chars \w* 1 captured char (\w) followed referenced grouping 1 captured \1 , more optional chars \w* if want allow letters (no digits , underscores), replace \w [a-z] , replace (?s) (?is) create case-insensitive.
regex perl tags
No comments:
Post a Comment