Sunday, 15 May 2011

python - Accessing an value in defaultdict and stripping out url portion of it -



python - Accessing an value in defaultdict and stripping out url portion of it -

i have big defaultdict has dict within dict, inner dict containing html email body. want homecoming http string within inner dict. what's best way go extracting that?

do need convert dict info construction before using regex? there improve way? i'm still new python , appreciate pointers.

for example, i'm working with:

defaultdict(<type 'dict'>, {16: {u'seq': 16, u'rfc822': u'delivered-to: somebody@email.com lots more html until http://the_url_i_want_to_extract.com' }}

one thing i've tried using re.findall on defaultdict didn't work:

confirmation_link = re.findall('click link confirm registration:<br />" (.*?)"', body) conf in confirmation_link: print conf

error:

line 177, in findall homecoming _compile(pattern, flags).findall(string) typeerror: expected string or buffer

you can only utilize regular expression, 1 time you've iterated on dictionary corresponding value:

import re d = defaultdict(<type 'dict'>, {16: {u'seq': 16, u'rfc822': u'delivered-to: somebody@email.com lots more html until http://the_url_i_want_to_extract.com' }} k, v in d.iteritems(): #v dictionary contains html string: str_with_html = v['rfc822'] #this regular look starts matching http, , #continuing until white space character hit. match = re.search("http[^\s]+", str_with_html) if match: print match.group(0)

output:

http://the_url_i_want_to_extract.com

python regex dictionary defaultdict

No comments:

Post a Comment