Thursday, July 1, 2010

Serving file downloads with non-ASCII filenames

Recently, while helping out one of my coworkers, it came to my attention that there is no universally-agreed on way to download a file to a web browser while suggesting a filename that contains non-ASCII characters.

The common way to tell a browser to download a file (rather than try to display it in-browser) is to include a Content-Disposition header in the HTTP response; the header's value should be "attachment". Additionally, the server can include a filename parameter in the Content-Disposition header as a suggestion to the browser for what filename to save the file as.

As a bit of history, the Content-Disposition header was originally defined in RFC 1806 which was obsoleted and replaced by RFC 2183. However, the Content-Disposition header was originally defined for use in MIME messages and, while RFC 2616 (HTTP 1.1) makes reference to the Content-Disposition header, it does so only to note that:
Content-Disposition is not part of the HTTP standard, but since it is widely implemented, we are documenting its use and risks for implementors.

Luckily, while not officially standardized for use in HTTP, the Content-Disposition header is "widely implemented" indeed; it seems that all modern browsers implement the header. If the web server responds to a request with application/octet-stream data and a Content-Disposition header of "attachment", your browser will display the familiar "Save As..." dialog. If the server included a filename parameter in that Content-Disposition header, your browser will likely pre-fill the filename input field of the "Save As..." dialog with the specified filename.

But here is where things start getting murky.

RFC 2183, skirts the issue of international filenames by disclaiming responsibility:
Current [RFC 2045] grammar restricts parameter values (and hence Content-Disposition filenames) to US-ASCII. We recognize the great desirability of allowing arbitrary character sets in filenames, but it is beyond the scope of this document to define the necessary mechanisms.

So, as long as the downloaded files' names are always representable in the ASCII charactet set, any browser should properly display the filename (although I've seen rumors that some browsers, such as IE, do enforce a limit on the length of the filename). However, I work at a Japanese company, making products largely for the Japanese market, so we don't have the privilege of assuming the whole world is ASCII.

By the way, in case you are curious, even iso-8869-1 (latin1) isn't consistently supported across browsers so Europeans are left high-and-dry too.

You are probably thinking, like I was, that surely this is a solved problem. And actually, it is. Kind of. The Content-Disposition header originates with the MIME protocol which, since the publication of RFC 2231 in 1997, now supports non-ASCII character encodings for header values. So, for example, the filename "foo-ä.html" can be represented in the Content-Disposition header like so:
Content-Disposition: attachment; filename*=UTF-8''foo-%c3%a4.html

The problem is that few browsers actually implement this RFC 2231 syntax. For example, Firefox 3.6 and Opera 10 appear to support the RFC 2231 syntax. On the other hand, for Internet Explorer, Microsoft's developers choose to simply perform URL-style percent-decoding and then interpret the result as bytes of UTF8-encoded characters. So a server would need to send the Content-Disposition header as
Content-Disposition: attachment; filename="foo-%c3%a4.html"
for an MSIE user to see "foo-ä.html" in the "Save As..." dialog.

Despite requests for IETF working group members to fix it, Google's Chrome browser also does not comply with RFC 2231, preferring to follow Microsoft's lead and use simple URL-style percent decoding.

As a result, there is no consistent cross-browser way to suggest a non-ASCII filename for a file download. I'm sure it doesn't help that the Content-Disposition header has never formally been part of the HTTP specification, but yet it is used by all major browsers to implement file download functionality.

Julian Reschke has compiled a test suite and publishes a nifty page illustrating all of incompatibilities between browsers regarding handling of the Content-Disposition header. In addition, as part of the IETF Network Working Group, he is working on an RFC to formally define the interpretation of the Content-Disposition header in the HTTP context.

Unfortunately, because the ambiguity has been left unresolved for so long, some web servers have adopted the MSIE/Chrome encoding technique for their non-ASCII filenames. Actually, my gut feeling is that probably most have, although I don't have any hard numbers to back up that claim. The good news is that since the MSIE/Chrome encoding is only used for parameters in the form filename="..." while the RFC 2231-style encoding used by Firefox, Opera, and Julian's proposal uses filename*=... it is possible for the two to coexist in the same Content-Disposition header (note the presence of the * in the RFC 2231 format to differentiate it).

In fact, probably the most important section of Julian's proposal is section 4.2 where defines the HTTP client's behavior when the server responds with both filename=... and filename*=..., allowing for an easy upgrade path for MSIE and Chrome.

For now, however, Julian's test results show that when presented both traditional and extended formats, only Firefox and Opera will select the extended filename*=... format.

This opens an opportunity for those of us that need to serve file downloads containing non-ASCII filenames: we can include the filename in non-standard encoding supported by MSIE and Chrome first in the Content-Disposition header, followed by the filename in extended RFC 2231 encoding. According to Julian's tests, MSIE and Chrome will always take the first parameter while Firefox and Opera will properly selecte the extended-syntax parameter, no matter what order it appears in.

For example, if the server includes the header:
Content-Disposition: attachment; filename="foo-%c3%a4.html" filename*=UTF-8''foo-%c3%a4.html
all four major browsers should properly display the filename "foo-ä.html" in the "Save As..." dialog. Unfortunately, WebKit-based browsers, like Apple's Safari browser, would display the raw percent-encoded value "foo-%c3%a4.html" as the filename. At least for now, though, I'm afraid this is the best we can do.


Bart said...

Great post, which I only just stumbled upon, because I used Apple's Safari to download/export a spreadsheet from Google Docs. Space is encoded as %20...

Matthew Schinckel said...

If you don't supply a filename, then the filename of the file that is being served gets used: that seems to retain the correct unicode characters, at least in my testing.