I recently spent some time tracking down a Syndic8 performance problem, and I found some pretty horrible code in the PHP to bzip2 interface.
In order to conserve disk space, I had been using PHP’s bzcompress and bzdecompress functions to compress and decompress the XML for each feed. The Syndic8 polling process had been getting slower over time, but I was pretty sure that it was related to the amazing growth in the number of feeds in the database (nearing 250K as I write this).
Today I spent some time isolating and profiling the compression and decompression functions, and I was bythe results. I found a gigantic XML feed (about 5 MB of XML, uncompressed) and I stored it in a file. Then I wrote a little bit of code to read in the XML, and then to start compressing and decompressing bigger and bigger chunks of it. The results were simply amazing. Here is an excerpt from the output of my test program, as measured on my dual-processor 2.8 Ghz Xeon machine:
Zipped string 177181 bytes long in 99 milliseconds, unzipped it in 2.7 seconds
Zipped string 531543 bytes long in 1.2 seconds, unzipped it in 28.8 seconds
Zipped string 1063086 bytes long in 2.7 seconds, unzipped it in 1.3 minutes
Zipped string 2126172 bytes long in 5.1 seconds, unzipped it in 4.7 minutes
Zipped string 2303353 bytes long in 3.3 seconds, unzipped it in 4.5 minutes
Zipped string 3012077 bytes long in 2.5 seconds, unzipped it in 11.2 minutes
Zipped string 4075163 bytes long in 13.8 seconds, unzipped it in 21.6 minutes
Zipped string 4961068 bytes long in 11.8 seconds, unzipped it in 51.7 minutes
Every other compression algorithm that I am familiar with eats more time on compression than on decompression. Keeping a machine this big busy (with a CPU-bound process) for nearly an hour is a real accomplishment. Something was clearly wrong.
I just snuck a look at the PHP to Bzip2 interface code. This is file php-4.3.6/ext/bz2/bz2.c within my PHP source tree, and available online here.
The code in the bzdecompress function is just stunningly bad. I’m just not sure what the author was thinking when he wrote it, or what his test cases were. The code starts out with an initial buffer size of 4096 bytes. It then enters a loop. The body of the loop allocates a buffer of the current buffer size, and then attempts to decompress into the buffer. If this fails, the size is incremented by 4096 and the loop goes around again This means that my 5MB string was partially decompressed about 1000 times, and the memory buffer was reallocated that many times as well!
Ouch, ouch, ouch!
I could make this a lot better in 20 minutes, but I decided as a matter of policy that I am a tool user, not a tool builder, when I am working on Syndic8. Why? Because building and fixing tools is fun, and I don’t want to get distracted.
If someone is reading this and wants to make and commit a real fix to the PHP 4 and 5 trees, I’ll be happy to tell you how to make this a lot more efficient.