Monday, July 23, 2007

Python: Serializer benchmarks

I am working on a project in which clients will be submitting more data than my current server implementation knows what to do with. The reason the current implementation doesn't use all of the submitted data is that I don't yet know what the quality of the data will be until the client is deployed in the wild. I want to record all of the submitted data, though, in the expectation that a future implementation will be able to use it. So I was considering formats for logging the submitted data such that it would be easy to parse in the future.

Since I'm already storing a subset of the submitted data in a database, the most obvious solution is to make a table of submissions which has a column for each submitted data element. However, it turns out that this is quite slow and given that I'm not sure how much of the extra data I'll ever need or when I may update the server implementation to use it, I hate to pay a hefty price to store it now. For now, I can consider the data write-only. If and when I need to use that data, then I can write an import script that updates the server database using the saved data.

So I've been considering simply logging the submissions to a file. It is considerably faster to append to a flat file than it is to write to a database -- which makes sense since the database supports read/write access, whereas I only need write-only access for now.

The next question is what format to write the data to the log file. I have a python dictionary of the submitted data; at first I considered writing the dictionary to the log file in JSON format. The JSON format is relatively easy to convert to/from python data structures and python has quality implementations to do it. Furthermore, unlike the pickle text format, it is trivial to visually interpret the serialized data. This latter point is also important to me since I need to be able to judge the quality of the data in order to discern what portions I can use in the future.

However, to my chagrin, it turns out that the JSON module I have been using, simplejson, is slower than I had imagined. Profiling of my server implementation found that, after the database update logic, serializing the submitted data into JSON format was my second largest consumer of CPU cyles. I hate the thought of wasting so much time logging the data when it is an operation that is essentially pure overhead.

Hence I started considering other serialization formats, benchmarking them as I went. Here are the results of my benchmarks:


SerializerRun 1 (secs)Run 2 (secs)Mean (secs)
pyYAML 3.0521953.1825482.6123717.89
pySyck 0.61.23107.062805.382956.22
pprint2364.912368.422366.67
pickle1509.311665.161587.23
pickle/protocol=21359.401330.711345.05
simplejson 1.7.1710.78604.13657.46
cPickle159.27172.26165.77
repr73.5077.2475.37
cjson 1.0.363.9474.2869.11
cPickle/protocol=250.9757.7254.34
marshal12.5213.3212.92

All numbers were obtained using the timeit module to serialize the dictionary created by the expression "dict([ (str(n), n) for n in range(100) ])".
The tests were run under Python 2.5 (r25:51908, Mar 3 2007, 15:40:46) built using [GCC 3.4.6 [FreeBSD] 20060305] on freebsd6. The simplejson, cjson, pyYAML, and pySyck modules were installed from their respective FreeBSD ports (I had to update the FreeBSD pySyck port to install 0.61.2 since it otherwise installs 0.55).

I guess I should not have been surprised, but it turns out that simply calling repr() on the dictionary is almost 9 times faster than calling simplejson.dumps(). In fact, taking repr() as a baseline (100%), I calculated how long each of the other serializers took relative to repr():

SerializerMean (secs)Relative to Baseline
pyYAML 3.0523717.8931469%
pySyck 0.61.22956.223922%
pprint2366.673140%
pickle1587.232106%
pickle/protocol=21345.051785%
simplejson 1.7.1657.46872%
cPickle165.77220%
repr75.37100%
cjson 1.0.369.1191.7%
cPickle/protocol=254.3472.1%
marshal12.9217.1%

The numbers in the last column are how much longer it took to serialize the test dictionary using the given serializer than it was using repr().

So now I'm thinking of sticking with JSON as my log format, but using the cjson module rather than simplejson. cPickle's latest binary format (protocol=2) is even faster, but I would lose the ability to visually scan the log file to get a feel for the quality of the data I'm not currently using.

Now, before I get a horde of comments I should point out that I am aware that simplejson has an optional C speedups module. Unfortunately, it does not appear to be installed by default on either FreeBSD (my server) or on Windows (my current client). I wouldn't be the least bit surprised if the C version of simplejson is just as fast as the cjson module, but it doesn't matter if it isn't installed. As such, it looks like I'll be switching to cjson for my JSON serialization needs from now on.

Update 2007/07/25 07:07pm:
In response to paddy3118's comment, I added benchmarks for the python pprint module to the tables above.

Update 2007/07/27 12:26pm:
In response to David Niergarth's comment, I added benchmarks for pyYAML 3.05 and pySyck 0.61.2.

12 comments:

Gary Bernhardt said...

For those who consider switching serializers for performance reasons: beware! cjson's decoder is incomplete, and will incorrectly decode some data encoded by simplejson. Inspired by this post, I just posted some details about this incompatibility.

Kelly Yancey said...

Thanks for the info! That seems like a trivial issue in cjson to fix. Luckily, I currently don't have any application that needs to parse JSON coming from arbitrary sources, so I think I can still get by with cjson in the meantime.

Paddy3118 said...

Have you tried pprint?
It has the advantage over repr of sorting the keys.

If pprint is to slow then you could write a routine to print python dictionaries in sorted order of keys...

- Paddy.

Peter said...

Take a look at HDF5 and PyTables for fast serialization. It is wrapped around optimized C and is somewhat faster than cPickle.

http://www.python.org/pycon/papers/largedata.pdf

Kelly Yancey said...

Paddy: Thanks for the comment.

I hadn't considered pprint because I was pretty sure it was implemented in python so it should be as slow or slower than simplejson. But since I benchmarked python's pickle module (which is also implemented in python) I went back and updated the tables to include benchmarks for pprint too. Specifically, I used pprint's pformat() function since that has the same API has the other serializers I was benchmarking. Actually, I'm a little surprised just how slow it turned out to be.

David Niergarth said...

You might also consider YAML, which is also human readable.

http://pyyaml.org/
http://www.yaml.org/

"""YAML(tm) (rhymes with "camel") is a straightforward machine parsable data serialization format designed for human readability and interaction with scripting languages such as Perl and Python. YAML is optimized for data serialization, configuration settings, log files, Internet messaging and filtering."""

Kelly Yancey said...

David: I just updated the tables to include YAML. I'm not familiar with the YAML specification, but it must be a monster for the C implementation (pySyck) to be slower than both pickle and pprint's implementations, both of which are in python. The python implementation of YAML takes the new title of slowest serializer of the lot.

Kelly Yancey said...

Peter: Thanks for the pointer. The paper is a very interesting read and pyTables looks promising.

While pyTables and HDF5 are both BSD-licensed, for some reason the FreeBSD port adds a dependency on the (otherwise optional) lzo library which is GPL'ed.
I'm going to try building without that dependency (I make it a policy not to touch GPL'ed code with a ten-foot pole). Nonetheless, I won't add it to my serializer benchmarks since pyTables is a storage solution more in line with a database or Berkeley DB.

Thomas said...

You may also want to try PyYAML compiled with LibYAML. That's a LOT faster than PyYAML in pure-python. In my experience, it's still slightly slower than PySyck, so it's of little interest to you as a serialisation format.

David Niergarth said...

Thanks Kelly for adding YAML to your benchmarks. YAML is one of those technologies I've always wanted to find a use for but never have. Holy cow is it slow on your benchmark! I almost feel bad suggesting it, although it may have redeemed itself somewhat just by being so spectacularly slow! ;) You're right about the spec (it's 85 PDF pages) but it's hard to imagine what it could be doing to be so sluggish.

Sergey Miryanov said...

In Python.pickle doc says:
The pickle module keeps track of the objects it has already serialized, so that later references to the same object won’t be serialized again. marshal doesn’t do this.

Is a cjson keep track as pickle?

_generator said...

you might want to add pyamf too.