regex - Python Extract every sentence that contains Parenthesis -
with open(searchfile) f: pattern = "\.?(?p<sentence>.*?\(([a-za-z0-9_]+)\).*?)\." line in f: match = re.search(pattern, line) if match != none: print match.group("sentence")
i trying extract every sentence contains acronym in parenthesis (essentially 2-4 letter caps in parenthesis.
in: here (abc) example. not include sentence. include (ab) one. , (avcd) one.
out: here (abc) example. include (ab) one. , (avcd) one.
you can utilize this:
[^.]*?\([a-z]{2,4}\)[^.]*\.
but note particulary inefficient way, since pattern starts permissive subpattern. can right little adding kind of anchor @ begining:
(?:(?<=.)|^)[^.]*?\([a-z]{2,4}\)[^.]*\.
unfortunatly, anchor, regex engine must check 2 alternatives of characters of string.
a improve approach might find substrings starting acronym until end of sentence , dots, , extract substrings using end offset of each results:
#!/usr/bin/python import re txt = 'here (abc) example. not include sentence. include (ab) one. , (avcd) one.' pattern = re.compile(r'([!.?])(?=\s)|\([a-z]{2,4}\)[^.]*(?:\.|$)') offset = 0 result = '' m in pattern.finditer(txt): if (m.group(1)==none): result += txt[offset:m.end()] offset = m.end() print result
note: can sure dot stands end of sentence, can else.
regex file-io
No comments:
Post a Comment