Briefly: Python has pretty good Unicode support, but it implicitly converts Unicode strings to other encodings when it thinks it's necessary (such as when it hits a "print" statement). When implicitly converting, conversion is made to the "default encoding", which is 'ascii' by default and fairly difficult to change. It also by default uses 'strict' conversion, which means it raises an error when it gets a character it can't convert.
The effect of this is that when you're messing around with something that contains anything not in 7-bit ASCII, everything will suddenly grind to a halt when you hit a print statement, even if you're using a unicode-aware environment. Changing the default encoding is pretty straightforward on a site-wide basis, but that's kind of scary.
More information about this is available at http://diveintopython.org/xml_processing/unicode.html
My solution:
# sitecustomize.py
# this file can be anywhere in your Python path,
# but it usually goes in ${pythondir}/lib/site-packages/
import sys
import locale
if (locale.getpreferredencoding()):
sys.setdefaultencoding(locale.getpreferredencoding())
You might observe that I am trusting that the user has correctly configured their locale to provide the correct encoding. This seems reasonable. Is there a good reason that Python, even though it detects an encoding for stdout (correctly), can't seem to decide to use that encoding when coercing a unicode-string to a string for printing using the print statement? (Note: print goes to stdout by default) How hard would it be to grab the encoding off the target to which the output is going and USE IT?
(Note: I'm using ipython, which generally rocks...from what I'm seeing out there, it doesn't seem like that's at all relevant to my problem. But if anybody knows a more elegant way to deal with this, or *why* python/ipython doesn't use the locale information to figure out what it should do when it needs to print a unicode string, I'd love to know.)