Syndication Scalability

4 minute read

Site operators are starting to get concerned about the scalability of the entire poll-based syndication model. In fact, Microsoft caused a stir within the community when they stopped including complete content in their MSDN feeds. Others (i.e. Robert Scoble) are starting to do the math, and the problem is becoming clear.

All of this talking is great — it certainly puts the issue out in the public. But we have to do something! I have some thoughts, and some working code.

First, some background…

Over 4 years ago I did some consulting work for what was then a tiny Seattle company. After meeting the two founders and listening to them talk for two hours, I walked away thinking “I am not sure what they are doing, but it is going to be cool.” They were talking about level 7 routers, event distribution, internet-wide scaling, real-time notification, and lots more. I was invited to meet with them after they saw an early version of my Headline Viewer news aggregator. At that time (and several times since) we talked about flowing headlines on to the desktop in real time.

I’ve been pursuing that goal ever since.

This embryo of a company was ultimately funded by Kleiner Perkins. The two founders were Adam Rifkin and Rohit Khare, and the company was KnowNow. I should add, as a disclaimer, that I do own a little bit of KnowNow stock as a result of my stint there as Temporary VP of Engineering.

KnowNow went on to build a great product, and they also spun off an open source version of it as mod-pubsub. This Apache module encapsulates all of the core publish, subscribe, and routing functionality of their commercial-grade product.

There is also a public instance of the product running at the same site.

Since early June I have been working to make Syndic8 into a great ping receiver. It now receives and processes a ping every couple of seconds and displays the results in the Pinged Feeds Box. I did all that I could to make the ping processing efficient and lightweight, even going so far as to use a RAM-based MySQL table for some transient data elements.

This week I took the next steps.

First, I made sure that each ping was a legitimate ping. There are two sources of what I will call “bogus” pings. First, some sites, in desperate need of attention, will ping even though the associated feed has not changed. According to some stats that I started tracking yesterday, about 2/3 of the pings are bogus. Second, some people will fine-tune a blog entry after it has been published. This seems to generate some spurious pings.

Second, I figured out what was truly new as of each ping. Because I store all of the XML for every feed in the Syndic8 feed list, it was a very simple matter to parse the old one, parse the new one, and compare them. This results in 0 or more new items (title, link, description, and so forth).

Third, I published the new items to the topic /what/syndic8.com/news/items at mod-pubsub.org. You can see the new items in the Event Introspector application at that site. This is a real-time browser-based application.

The ping processing within Syndic8 takes around 3 seconds, on average. This is mainly due to the need to actually fetch the feed; the internal processing is cheap, efficient, and scalable — I use a message queue (known as a “System V message queue” when I was a kid) as the asynchronous coupling between the first-stage processing when the ping is received, and the second-stage when the XML is fetched). I can easily add more processes if the queue length starts to grow. It would not be hard to move this processing off to another machine (or machines) if necessary.

The publishing end of this has been running for about 24 hours. So far, so good. Latency is low, system performance is still good. I’m working with some members of the mod-pubsub team to get some demos cooked up. We need to get some aggregator developers to take a look at what we’ve done so far and to figure out what else has to be done. We definitely need to work on categorization and metadata to allow aggregators to listen for changes within topical areas.

I think this could be the start of something big. From the blogger’s point of view I think this is pretty cool. Less than 5 seconds after a post is written and published, it can be present on machines all around the world.

We are now a few steps closer to that goal we settled upon over 4 years ago in that now-condemned building in Seattle, making plans in a room paneled entirely with white boards. Its been a great journey so far, but the best is undoubtedly yet to come.

Updated: