Thursday, 15 August 2013

python - UnicodeDecodeError: Converting type string to unicode -



python - UnicodeDecodeError: Converting type string to unicode -

i trying replace text. unfortunately, main string stored type unicode, string describes text replaced stored type string. below reproducible example:

mystring = u'bunch of text non-standard character in name rubén' old = 'rubén' new = u'newtext' mystring.replace(old, new)

this throws error:

unicodedecodeerror: 'ascii' codec can't decode byte 0xe9 in position 3: ordinal not in range(128)

i same error when seek convert old unicode unicode(old). several answers solve problem specific characters, cannot find generic solution.

you need convert old value unicode explicit codec. codec depends exclusively on how sourced old.

if string literal in source code, utilize source code encoding. python won't take source file unless specified valid codec @ top in comment; see pep 263

pasting old definition terminal utilize terminal codec (the terminal sends python encoded bytes paste).

if info sourced anywhere else, you'll need determine encoding source. http data, check content-type header charset parameter, example.

then decode:

old = old.decode(encoding)

when utilize unicode(old) without explicit codec, or seek utilize bytestring in unicode.replace(), python uses default codec, ascii.

demo in terminal, configured utilize utf-8:

>>> import sys >>> sys.stdin.encoding # reflects detected terminal codec 'utf-8' >>> old = 'rubén' >>> old # shows encoded info in python string literal form 'rub\xc3\xa9n' >>> old.decode('utf8') # unicode string literal form u'rub\xe9n' >>> print old.decode('utf8') # string value written terminal rubén >>> mystring = u'bunch of text non-standard character in name rubén' >>> new = u'newtext' >>> mystring.replace(old, new) traceback (most recent phone call last): file "<stdin>", line 1, in <module> unicodedecodeerror: 'ascii' codec can't decode byte 0xc3 in position 3: ordinal not in range(128) >>> mystring.replace(old.decode('utf8'), new) u'bunch of text non-standard character in name newtext'

generally speaking, want decode early, encode late; create info flow unicode sandwich. receive text, decode unicode values, , don't encode 1 time again until info leaving program.

python python-2.7 unicode

No comments:

Post a Comment