Friday, July 16, 2010

HTML encoding of form inputs

I suppose this is common knowledge amongst professional web developers but I just discovered myself that if a user enters characters into a HTML form input that is not representable in the character set of the page the form is in, browsers will HTML-encode the non-representable characters when the form is submitted. I just spent over an hour assisting a coworker to track down a bug in one of our web applications that was due to this poorly-documented -- but reasonable -- behavior.

I say "reasonable" because, as obscure as it is, this is really the best thing I think a browser can do given the situation.

To recap, here is the scenario:
  • You have a web page with a form in it that is served using some locale-specific encoding. In our case it was Shift-JIS, but the default ISO8859-1 encoding leads to the same problem.
  • The user enters text into a form input field that is not representable in the displayed page's character set or encoding. For example, entering Cyrillic characters into a form displayed on an ISO8859-1 page.
  • When the user submits the form, the browser tries to convert the inputs to the encoding of the page. Any character not representable in that page's character set or encoding has its Unicode character code point encoded as an HTML numeric character reference (e.g. DŽ).
  • The web application or CGI receiving this input needs to a) know the character encoding of the page that was used to submit the form data so it knows how to interpret the data as characters and b) be prepared to convert any embedded HTML numeric character references back to their corresponding characters.

I like that last part where web applications (or CGIs) have to know the encoding of an HTML page served to the client in order to be able to properly parse input from that client. This fact shatters any remaining fantasies I had of HTTP being stateless.

Anyway, the real surprise is that a web application or CGI needs to be prepared to unencode HTML entities in form input. I quick check of perl's CGI.pm and python's cgi module indicates that neither of them do entity decoding of inputs automatically. And considering that information on the web regarding this behavior is sparse , I suspect that most web developers are unaware of it. At the time of writing, I can only find two references [1][2] that document HTML character reference encoding in the scenario described above.

Luckily, there is a really simple solution: always serve pages in UTF8 encoding and always expect form input to be in UTF8 encoding. One of the many great things about UTF8 encoding is that all characters are representable, so you never have to worry about the browser resorting to HTML character reference encoding.

No comments: