regex - Finding words with doubled letters in HTML text with a regexp -
how write regular look finds words doubled letters in document?
by doubled letters mean: "s in progress", "d , s in address", "o in tool" , on. want match these words within <body>
part of html document ?
below bit of code shows trying do:
while (<>){ if (/<body(.*)>/ .. /<\/body>/){ foreach ($_){ print $_ =~ /\b\w{0,10}(\w)\1\w{0,10}\b/; } } }
this not obvious task, first off because parsing html regex hazardous. disclaimers doing so, here's regex job:
(?s)(?:<body>|\g)(?:.(?!</body>))*?\k\b\w*(\w)\1\w*\b
see the demo.
in perl:
@result = $subject =~ m%(?s)(?:<body>|\g)(?:.(?!</body>))*?\k\b\w*(\w)\1\w*\b%g;
(?s)
allows dot match newlines (?:<body>|\g)
matches <body>
or ending position of previous match (?:.(?!</body>))*?
lazily matches chars not followed closing </body>
tag \k
tells engine drop had been matched far returned match \b\w*(\w)\1\w*\b
matches word (without \b
boundaries) made of optional chars \w*
1 captured char (\w)
followed referenced grouping 1 captured \1
, more optional chars \w*
if want allow letters (no digits , underscores), replace \w
[a-z]
, replace (?s)
(?is)
create case-insensitive.
regex perl tags
No comments:
Post a Comment