Monday, April 2, 2007

Python: HTTP Accept-Language header parsing

The more I dig through the code, the more Paste is growing on me.

However, I noticed a few nights ago that Paste's Accept-Language header parsing is subtly non-RFC-compliant (sorry JJ!). The issue is that the regular expression it uses (from paste.httpheaders class _AcceptLanguage):
languageRegEx = re.compile(r"^[a-z]{2}(-[a-z]{2})?$", re.I)
does not match all language tags defined by section 3.10 of RFC 2616. Admittedly, it matches all language tags in common usage, but fails to comply with the letter of RFC 2616 in that it fails to match tags such as en-cockney, i-cherokee, or x-pig-latin. The RFC says:
any two-letter primary-tag is an ISO-639 language abbreviation
and any two-letter initial subtag is an ISO-3166 country code.

Note that it does not mandate that all primary-tags and subtags must be two-letters in length nor does it restrict the number of subtags to the set of zero and one. It just says that if they are two-letters, they have the meanings cited. In fact, the augmented BNF grammar only says the primary-tag is one or more 8-bit alphabetic characters followed by 0 or more subtags, each consisting of one or more 8-bit alphabetic characters.

Luckily, the problem regex isn't integral to the parsing algorithm and can be safely removed. As such, all that appears to be necessary to bring to code into RFC 2616 compliance is to remove the definition of languageRegEx as well as the following two lines:
if not self.languageRegEx.match(lang):
continue
With that obscure bug fixed, now you are ready to start serving Upper Sorbian, Cockney, or even Klingon localized versions of your Paste-powered web site.

(Now if we could only get the httpheaders code into the python standard library so everyone can get the benefit of bug-free parsing, whether they use Paste or not.)

1 comment:

jjinux said...

Thanks for illustrating my ignorance ;) Make sure to submit a patch to the Paste guys (i.e. Ian Bicking)!