Since I'm already storing a subset of the submitted data in a database, the most obvious solution is to make a table of submissions which has a column for each submitted data element. However, it turns out that this is quite slow and given that I'm not sure how much of the extra data I'll ever need or when I may update the server implementation to use it, I hate to pay a hefty price to store it now. For now, I can consider the data write-only. If and when I need to use that data, then I can write an import script that updates the server database using the saved data.
So I've been considering simply logging the submissions to a file. It is considerably faster to append to a flat file than it is to write to a database -- which makes sense since the database supports read/write access, whereas I only need write-only access for now.
The next question is what format to write the data to the log file. I have a python dictionary of the submitted data; at first I considered writing the dictionary to the log file in JSON format. The JSON format is relatively easy to convert to/from python data structures and python has quality implementations to do it. Furthermore, unlike the pickle text format, it is trivial to visually interpret the serialized data. This latter point is also important to me since I need to be able to judge the quality of the data in order to discern what portions I can use in the future.
However, to my chagrin, it turns out that the JSON module I have been using, simplejson, is slower than I had imagined. Profiling of my server implementation found that, after the database update logic, serializing the submitted data into JSON format was my second largest consumer of CPU cyles. I hate the thought of wasting so much time logging the data when it is an operation that is essentially pure overhead.
Hence I started considering other serialization formats, benchmarking them as I went. Here are the results of my benchmarks:
Serializer | Run 1 (secs) | Run 2 (secs) | Mean (secs) |
---|---|---|---|
pyYAML 3.05 | 21953.18 | 25482.61 | 23717.89 |
pySyck 0.61.2 | 3107.06 | 2805.38 | 2956.22 |
pprint | 2364.91 | 2368.42 | 2366.67 |
pickle | 1509.31 | 1665.16 | 1587.23 |
pickle/protocol=2 | 1359.40 | 1330.71 | 1345.05 |
simplejson 1.7.1 | 710.78 | 604.13 | 657.46 |
cPickle | 159.27 | 172.26 | 165.77 |
repr | 73.50 | 77.24 | 75.37 |
cjson 1.0.3 | 63.94 | 74.28 | 69.11 |
cPickle/protocol=2 | 50.97 | 57.72 | 54.34 |
marshal | 12.52 | 13.32 | 12.92 |
All numbers were obtained using the timeit module to serialize the dictionary created by the expression "
dict([ (str(n), n) for n in range(100) ])
".The tests were run under Python 2.5 (r25:51908, Mar 3 2007, 15:40:46) built using [GCC 3.4.6 [FreeBSD] 20060305] on freebsd6. The simplejson, cjson, pyYAML, and pySyck modules were installed from their respective FreeBSD ports (I had to update the FreeBSD pySyck port to install 0.61.2 since it otherwise installs 0.55).
I guess I should not have been surprised, but it turns out that simply calling
repr()
on the dictionary is almost 9 times faster than calling simplejson.dumps()
. In fact, taking repr()
as a baseline (100%), I calculated how long each of the other serializers took relative to repr()
:Serializer | Mean (secs) | Relative to Baseline |
---|---|---|
pyYAML 3.05 | 23717.89 | 31469% |
pySyck 0.61.2 | 2956.22 | 3922% |
pprint | 2366.67 | 3140% |
pickle | 1587.23 | 2106% |
pickle/protocol=2 | 1345.05 | 1785% |
simplejson 1.7.1 | 657.46 | 872% |
cPickle | 165.77 | 220% |
repr | 75.37 | 100% |
cjson 1.0.3 | 69.11 | 91.7% |
cPickle/protocol=2 | 54.34 | 72.1% |
marshal | 12.92 | 17.1% |
The numbers in the last column are how much longer it took to serialize the test dictionary using the given serializer than it was using
repr()
.So now I'm thinking of sticking with JSON as my log format, but using the cjson module rather than simplejson. cPickle's latest binary format (protocol=2) is even faster, but I would lose the ability to visually scan the log file to get a feel for the quality of the data I'm not currently using.
Now, before I get a horde of comments I should point out that I am aware that simplejson has an optional C speedups module. Unfortunately, it does not appear to be installed by default on either FreeBSD (my server) or on Windows (my current client). I wouldn't be the least bit surprised if the C version of simplejson is just as fast as the cjson module, but it doesn't matter if it isn't installed. As such, it looks like I'll be switching to cjson for my JSON serialization needs from now on.
- Update 2007/07/25 07:07pm:
- In response to paddy3118's comment, I added benchmarks for the python pprint module to the tables above.
- Update 2007/07/27 12:26pm:
- In response to David Niergarth's comment, I added benchmarks for pyYAML 3.05 and pySyck 0.61.2.