Vista Torrents, Bandwidth, Scalability, and Amazon S3
Disclaimer: This is a post which reflects my personal opinions, and should not be construed as something that in any way reflects the opinions, thoughts, policies, or beliefs of my employer.
Last week Microsoft made a beta version of Windows Vista available as an open and free download. Predictably, demand for this single-file, 3.1 gigabyte downl0ad was very high, and taxed Microsoft’s ability to keep up. Microsoft also offered to ship out copies of Vista on a DVD.
In a conversation reported by Chris Pirillo, Microsoft acknowledged that the download was very popular. I was fine with this, Vista has been a long time in the making and a lot of people are anxious to give it a try. However, Chris also reported that Microsoft feared that opening up the pipes to allow additional concurrent downloads could actually put the infrastructure of the internet at risk. Without knowing the details behind this, I am a bit skeptical, but that’s not the point of this post.
In the same chat session where Microsoft acknowledged the reality of the slow downloads, an attendee asked why Microsoft didn’t create a BitTorrent seed for the file. Again, and somewhat predictably, Microsoft expressed their fear that this could result in people receiving corrupted or even fraudulent downloads.
Never one to sit idly by, Chris and his business partner Jake Ludington took the bull by the horns (no Longhorn joke intended) and created VistaTorrent.com , seeding it with an official copy of the 3.1 gigabyte Vista download. Chris announced this on Sunday evening.
If you’ve never used BitTorrent before, I should explain a few terms at this point. BitTorrent is a protocol which makes peer-to-peer file transfer simple and efficient. In the old days (before BitTorrent) file transfer took place on a point to point, server to client, basis. If 100 clients downloaded the same file, the server would see 100 requests, and the server would consume bandwidth equal to 100 times the size of the file. With BitTorrent, things are a lot different, and a bit more equitable. Instead of a central server, there’s a central location known as a tracker. As its name implies, the tracker keeps track of a set of BitTorrent clients where the bits and pieces of a file can be found. The clients are all simultaneously downloading parts of the file while keeping the other simultaneously running clients aware of which parts of the file that they have and which pieces they need. By adding in some algorithmic randomization, all of the clients eventually wind up with a complete copy of the file. At various times in the downloading process, each “client” will be both a client (to download data) and a server (to provide data to other clients). One important fact, which I have skipped in my quick explanation above, is that the file has to come from somewhere. Any client which has a complete copy of the file is known as a seed. Putting that first complete copy of the file on the network is known as seeding the torrent. There’s also something called a swarm, but I don’t understand what it is or what it does.
The entire BitTorrent system has a number of remarkable properties. First, the system as a whole automatically tolerates the spontaneous appearance and disappearance of client applications from the overall network. As long as there is (in aggregate) one copy of each chunk of the file residing somewhere in the network, all of the downloads will eventually complete. BitTorrent measures the number of complete copies using a factor called availability. As long as this value is at least 1.0, all of the downloads should succeed (assuming that no vital client goes offline afterward). Second, the protocol is self-regulating and self-adjusting in the face of slow networks, fast networks, slow clients, and so forth. Third, the blocky nature of the protocol makes it possible for clients to lose the connection, and then make it again without having to restart the download.
There’s a huge amount of subtlety in the protocol, and Bram Cohen deserves some kind of international prize for making all of this work as well as it does.
When you “do the math” on the protocol, the results are very surprising. Let’s start with one seed, and 100 clients that want to download the same file. In the best (most totally random) case, each client will start up, pick a single random block, and request it from the seeding client. After that, the clients will become peers and each one will obtain the other blocks of the file from another client, and not from the seed. So, if all goes perfectly well, the seed will see just one request for each block of the file, and will use up bandwidth equal to the size of the file. Instead of the 100x case that we saw with the client-server download model, the bandwidth cost to the seed is effectively constant regardless of the number of downloads. There is traffic to and from the tracker, but this is very, very small when compared to the size of the file.
Ok, so there are some flies in the ointment. As you can imagine, BitTorrent is great for trading large media files (audio and video) that might or might not be legal. This has caused some people to equate “BitTorrent” and “piracy” in their minds. In fact, nothing could be further from the truth and there are many, many ways to use BitTorrent in a totally legal fashion. For example, I use it to do backups of large data files from my Syndic8 server. New versions of the Linux kernel are made available in this way, and I am sure that there are many other great examples. If I was a lawyer I’d mention something about “substantial non-infringing uses” in my defense of this technology.
You should know that the copyright owners of commonly pirated files have taken to posting “poisoned” downloads to the various BitTorrent directories. I don’t know a whole lot about this, but I do know that it is a fairly dirty trick and that it is a severe “monkey wrench” in the system.
Moving right along, one of the more interesting issues in web-scale computing is, literally, scalability. Specifically, the ability for centralized resources to grow in a cost-effective fashion to meet demand. As we can see from the Vista download crunch, client-server downloads don’t scale. When the richest company on the planet cannot afford sufficient resources to accomodate a download, something’s definitely wrong with the model. I hope that my little exposition of the BitTorrent protocol has shown that it can scale.
Now its time for the sales pitch (feel free to skip this part). Amazon’s Simple Storage Service (S3 for short) contains the foundational parts needed to build a scalable download solution. First, the actual bandwidth (20 cents per gigabyte) and storage (15 cents per gigabyte per month) charges are considered low by industry standards. Second, S3 supports a BitTorrent interface. Once a file has been uploaded to S3, it can be exposed as a fully seeded BitTorrent.
Let’s compare the cost for our hypothetical 100 clients to download the Vista binary in the traditional client-server fashion and with S3′s BitTorrent interface. To make it easy, let’s use S3′s bandwidth and storage costs for both cases. To make the math easier let’s assume that the binary is exactly 3 gigabytes in length.
The traditional download would cost 60 cents to upload, 45 cents to store and $60 to download, for a total of $61.05.
The BitTorrent download would again cost 60 cents to upload, 45 cents to store, but just 60 more cents to download, for a total of $1.65. As I noted above, this is the best possible case, where the clients use an optimal/perfect random distribution of initial block requests, so each block is downloaded from the seed just once. Even if the clients aren’t perfect and each block is downloaded, say, 5 times, the total cost is still just $3.45, or just 6% of the traditional cost.
Microsoft has some legitimate qualms about the integrity of the downloaded files. It wouldbe possible for some evil-doer to label almost any random 3 gigabyte pile of data as Vista and to convince people to download it.
Chris and Jake sidestepped this possible impediment by computing and posting the MD5 hash of the file, and asking people to verify the value against what was posted. The MD5 hash guarantees with a high degree of certainty that the downloaded bits are as offered.
So, if Microsoft wanted to join the modern peer-to-peer era, they could allow BitTorrent downloads and simply post MD5 hashes of the data. To make it even easier, they could distribute (via direct download) a small program that would verify the integrity of the data obtained via the torrent. The integrity checker could perform multiple checks on the data to give all parties involved a lot of confidence that the bits came from Microsoft.
I didn’t plan to write a book, but I hope that this has been at least somewhat educational. Leave me a comment, link to this, and let me know what you think.