An overview of BitTorrent

Back to basics: BitTorrent

Since the dawn of the internet, piracy, as in the distribution of copyrighted works without permissions, has been present on it. No matter how many crackdowns, people arrested or servers put offline, it is still there and going strong. This is largely due to the decentralization of file transfers in that space. Normally, you can shut down a server and pat yourself on the back for stopping “the operation”, but not with piracy. This decentralization is powered by BitTorrent, and it is one of the many uses of this interesting file transfer protocol.

To understand the protocol, we’ll have to start with some network architecture basics. After that, we will be able to fully go over the problem BitTorrent is trying to solve and do a quick overview of how it manages to solve it.

Network Architectures

When developing an application that holds and needs to distribute data between multiple computers, we need to think about how we organize who holds what and how this data gets from computer A to computer B. There are a few known ways to organize everything, and we’ll see the main two.

Client-Server model

The Client-Server model is the one we’re used to on the internet. Basically, it is a server that multiple clients make requests to so they can get data. If the clients need to pass data between them, they must pass it through the server since they can’t talk to each other.

See original picture by clicking here ↗

Almost all the services we’re using are made with this model. Let’s take Google Drive, for example. For those who don’t know, Google Drive is a cloud storage service offered by Google. Let’s say I wanted to pass some photos from my phone to my computer. From my phone, I’ll have to transfer those photos to the Google Drive servers, then I’ll have to log in on my Google account and download those files from their servers to my computer.

Peer-to-Peer model

The Peer-to-Peer model is the complete opposite. In this architecture, no server is needed for passing data between clients. The clients are “peers” and they talk directly to each other over a network.

See original picture by clicking here ↗

Let’s take our file transfer example again. This time, instead of passing through Google servers to transfer the photos between my phone and my computer, I’ll just have a software that makes the transfer from my phone to my computer directly.

The problems

Both of these models have their problems that become clear at scale. The highly centralized nature of the client-server infrastructure will play against it, while the decentralized nature of peer-to-peer will absolutely play against it too.

We can take some real-life examples. YouTube is a client-server application, so they store all the platform videos on their servers. This is fine at a small scale, but you can see how, at YouTube scale, they have to choose between always scaling their servers and infrastructure as videos don’t stop coming in or deleting videos to free up storage. If they start deleting videos, the highly centralized nature of YouTube means that, on paper, they’re wiping those out of existence, which isn’t the best thing to do sometimes.

For peer-to-peer, the problem comes from the decentralization. This means the computers must find each other. This becomes a complex problem at scale since there are a lot of computers in the world, and that isn’t counting cybersecurity and having to bypass some network protocol limitations (notably IPv4). This model also puts a lot of responsibility on the clients, well, all of them. If you lose your computer, you lose the data on it, so imagine having to locally store all of your banking information instead of storing them on the bank/credit union servers. Also, peer-to-peer networks make things generally hard to find since there isn’t a centralized place where all the data is.

So this is where you might want a hybrid solution. A mix of the two models. Many protocols exist and, for file transfers, one of them is BitTorrent. This protocol has a lightweight client-server component which makes it easy to integrate with modern web servers and removes the complexity of making the clients find each other, while the file transfers are happening in peer-to-peer so the server doesn’t have to support an enormous load of file transfers and storage.

How BitTorrent works

BitTorrent achieves the hybrid model through three pieces, each playing a role in the transfer: the Torrent, the Peer and the Tracker.

Torrent

The torrent itself is the metainfo file that has the .torrent file extension. It is a file full of metadata for the file/files you want to transfer all encoded in bencode. The bencode dictionary it contains has two keys: announce and info.

The announce key has the URL to the tracker (or trackers) that the peer needs to announce itself to. We’ll touch more on trackers later in this article, but for now, just know that a client needs to “announce” itself to it to be able to download the files listed in the metainfo file.

The info key contains all the useful information about the files being transferred using this torrent like name, directory structure, etc. It is a key part that helps the client put everything together while downloading and, also, allows you to see what you’re downloading. Another thing that is important that we’ll come back later to is that the protocol transfers files by slicing them in chunks and transferring it chunk by chunk. This key contains the information about how the files are divided in this torrent. Each piece has its own SHA1 hash, so the client can be certain it received the right chunk it asked for.

Peer

A group of clients that have the same torrent is a swarm. In a swarm, all those clients are called peers. To become one, a client must open a metainfo file with a torrent client; then it becomes a peer of that torrent. Peers announce themselves to the tracker. After that, they’re able to transfer the files listed in the metainfo file between each other.

There are two types of peers. The first is the seeder which is a peer that is making the file available for download. When a new peer joins the swarm, it’ll ask a seeder for a chunk of the file it wants (or all of it if there’s only one seeder). Without seeders, a torrent becomes dead and undownloadable so each swarm must have, at least, one seeder to be able to be downloaded.

The second is the leecher. The leecher is just someone that joined the swarm and is downloading the files. Normally, when all the files are downloaded, the torrent client will automatically switch to seeding the files (making that peer a seeder) unless asked not to. One small detail that helps the protocol scale is that leechers also seed chunks they downloaded to other leechers. This helps reduce the load on seeders.

Tracker

The tracker is the server-side part of the protocol that keeps track of peers. It was mentioned earlier that peers “announced” themselves to the tracker listed in the metainfo file. By announcing themselves, they send information such as the torrent’s hash, the network port they use for file transferring, and whether they are downloading or seeding.

The tracker has to track the peers in a swarm. Every time a peer announces itself to it, it has to update its record with the new information it received and returns a bencoded dictionary of other peers in that swarm. That bencoded dictionary simply contains the peer IP and port, and sometimes, an ID sent by the peer.

It is important to note that the tracker only receives the info hash of the torrent, which is a 20-byte SHA1 hash. It is able to keep track of swarms by keeping a record of who announced itself with that hash, but other than that, it has no idea what is being transferred between the peers. This makes trackers incredibly simple to create and support.

How they interact together

A classic interaction between those pieces goes like this:

The client opens the metainfo file (torrent).
The client announces itself as a new peer to the tracker listed.
The tracker adds it to what it knows is part of this swarm and answers with a list of other peers, seeder and leechers, inside that swarm.
The client can go around and download chunks of the files from multiple peers. This is where the peer-to-peer part of the protocol kicks in.
When the download is done, the peer becomes a seeder and seeds the files to other leechers.

In this workflow, all the peers reannounce themselves to the tracker every couple of minutes so it can keep the data of the swarm updated.

This is the normal workflow between all those pieces. The hybrid approach for passing important data, like peers in the swarm, and making the transfer in peer to peer is what makes BitTorrent a powerful protocol. Also, this approach has proven that it scales really well in the thousands of peers without too much trouble.