Showing posts with label unicode. Show all posts
Showing posts with label unicode. Show all posts

Friday, July 13, 2007

Parsing Japanese addresses

Last night Steven Bird, Ewan Klein, and Edward Loper gave a presentation about their Natural Language Toolkit at the monthly baypiggies meeting. The gist of the presentation seemed to be that their toolkit is just that: a set of basic tools commonly needed in implementing more complicated natural language processing algorithms and a set of corpora for training and benchmarking those algorithms. Given their background as academics, this makes sense as it allows them to quickly prototype and explore new algorithms as part of their research. However, I got the impression that a number of the attendees were hoping for more of a plug-and-play complete natural language processing solution they could integrate into other programs without needing to be versed in the latest research themselves.

When I get some time, I would like to try using NLTK to solve a recurring problem I encounter at work: parsing Japanese addresses. There is a commercial tool that claims to do a good job parsing Japanese postal addresses, but I've found the following python snippet does a pretty good job on the datasets I've been presented so far:
  # Beware of greedy matching in the following regex lest it
# fail to split 宮城県仙台市泉区市名坂字東裏97-1 properly
# as (宮城県, None, 仙台市, 泉区, 市名坂字東裏97-1)
# In addition, we have to handle 京都府 specially since its
# name contains 都 even though it is a 府.
_address_re = re.compile(
ur'(京都府|.+?[都道府県])(.+郡)?(.+?[市町村])?(.+?区)?(.*)',
re.UNICODE)
def splitJapaneseAddress(addrstr):
"""Splits a string containing a Japanese address into
a tuple containing the prefecture (a.k.a. province),
county, city, ward, and everything else.
"""
m = _address_re.match(addrstr.strip())
(province, county, city, ward, address) = m.groups()
address = address.strip()
# 東京都 is both a city and a prefecture.
if province == u'東京都' and city is None:
city = province
return (province, country, city, ward, address)

I should add that, unlike English, it does not make sense to separate and store the Japanese street address as its own value since the full address string is commonly what is displayed. So even though the routine above returns the street address as the final tuple item, I never actually use the returned value for anything.

Anyway, as you can see this regular expression is pretty naive. During last night's meeting I kept thinking that I should put together a corpus of Japanese addresses and their proper parses so that I can experiment with writing a better parser. The Natural Language Toolkit seems to be designed for doing just this kind of experimentation. I'm hoping that next time I'm given a large dataset for import into our database at work I can justify the time to spend applying NLTK to the task.

Tuesday, April 24, 2007

Python: Printing unicode

By default, the stdout stream in python is assumed to have ascii encoding. While this is the only safe assumption, it gets mighty annoying when your terminal supports utf8 or Microsoft's eponymous mbcs encoding (e.g. pyDev for Eclipse), especially when you are working with unicode data that you would like to print out while debugging.

It seems like this is a common problem. In fact, it comes numerous times at work and I met a gentleman at the past BayPiggies meeting who was looking for a solution himself. It doesn't help that sys.setdefaultencoding() is a red herring that seems to throw everyone off track.

Enough with the introduction, here is the snippet I use to get my stdout to a non-ascii encoding:
    import codecs, sys
sys.stdout = codecs.getwriter('mbcs')(sys.stdout)
Of course, change 'mbcs' to 'utf8' or whatever encoding you need. You can get fancy and look up the appropriate encoding based on the terminal environment (actually, 'mbcs' does this for you on Windows), but if you're just looking to print unicode for testing/debugging, this short snippet gets you to the goal in two lines of code.