When I get some time, I would like to try using NLTK to solve a recurring problem I encounter at work: parsing Japanese addresses. There is a commercial tool that claims to do a good job parsing Japanese postal addresses, but I've found the following python snippet does a pretty good job on the datasets I've been presented so far:
# Beware of greedy matching in the following regex lest it
# fail to split 宮城県仙台市泉区市名坂字東裏97-1 properly
# as (宮城県, None, 仙台市, 泉区, 市名坂字東裏97-1)
# In addition, we have to handle 京都府 specially since its
# name contains 都 even though it is a 府.
_address_re = re.compile(
ur'(京都府|.+?[都道府県])(.+郡)?(.+?[市町村])?(.+?区)?(.*)',
re.UNICODE)
def splitJapaneseAddress(addrstr):
"""Splits a string containing a Japanese address into
a tuple containing the prefecture (a.k.a. province),
county, city, ward, and everything else.
"""
m = _address_re.match(addrstr.strip())
(province, county, city, ward, address) = m.groups()
address = address.strip()
# 東京都 is both a city and a prefecture.
if province == u'東京都' and city is None:
city = province
return (province, country, city, ward, address)
I should add that, unlike English, it does not make sense to separate and store the Japanese street address as its own value since the full address string is commonly what is displayed. So even though the routine above returns the street address as the final tuple item, I never actually use the returned value for anything.
Anyway, as you can see this regular expression is pretty naive. During last night's meeting I kept thinking that I should put together a corpus of Japanese addresses and their proper parses so that I can experiment with writing a better parser. The Natural Language Toolkit seems to be designed for doing just this kind of experimentation. I'm hoping that next time I'm given a large dataset for import into our database at work I can justify the time to spend applying NLTK to the task.
No comments:
Post a Comment