Skip to content
Anders Pearson edited this page Jan 27, 2015 · 5 revisions

Cask is a content-addressable storage cluster. It aims for high availability, durability, and very little operations overhead. It has a basic external REST API and has pluggable storage backends.

Currently, the sweet spot is for making use of commodity hardware (large, inexpensive, but unreliable consumer hard drives) to store large amounts of infrequently changing data. Cask is not intended to be user-facing, but rather is a component to handle storage for another application (or applications).

Eg, say you are a photographer and you have 5TB of RAW files that you want archived, with more being steadily added. Most of those files will just sit there, but every now and then you'll need one of those files. You can burn lots of DVDs, burning each file to multiple DVDs to decrease the chances of them getting corrupted, but that's a lot of manual work and DVDs actually get expensive after a while at those rates. Consumer hard drives are faster and cheap. But they are notorious for biting the farm every now and then, so you need to make sure that the files each get copied to a few different drives. You can't really buy 5TB drives at reasonable prices (not yet in 2015, at least), so instead you need a few 3 or 4TB drives and you need to work out a system to spread your files across them (and then replicate between multiple drives). This is certainly doable, but is starting to get tedious. Then, since consumer drives can also silently corrupt data, you really ought to also be periodically checking everything stored on those drives, calculating checksums of the files, etc.

Cask basically handles all this kind of stuff for you. You can pick up a bunch of cheap drives, put them in cheap USB bays, start up a Cask node pointing at each, and it will do the rest for you. Spreading data evenly across the cluster, ensuring N replicas of each file on N different nodes, automatically checking for corruption and repairing it, etc.

Depending on your needs, you can also seamlessly integrate cloud storage with this setup. Cask currently supports S3 and Dropbox with more on the way. 1TB of storage will cost you $10/month on Dropbox or $30/month on S3, with prices falling regularly. So you could have a cluster with a few local USB drives, an S3 bucket, and a Dropbox account. Cask will balance between them without any extra effort.

A Cask cluster consists of one or more Nodes (fewer than three is probably not worth bothering with though, and five or more is strongly recommended), each of which has some storage that it's responsible for (either local disk-backed or one of the cloud storage backends), listens on a specified TCP port, and has a unique (probably randomly generated) ID.

The nodes discover each other via a Gossip protocol. When a file is uploaded to any node in the cluster (via a REST interaction), that node distributes copies of the file to N different nodes (determined by The Ring) and gives the client a key that it can use to later retrieve the file from any node of the cluster.

The cluster is self-balancing and self-repairing. If a node dies (hard drive failure, network goes down, explosion, etc), the rest of the cluster maintains availability and eventually replaces the copies that were stored on the dead node so that a full N copies are again available in the network. When a new node joins the cluster, file copies move onto it until the work and storage load is balanced evenly across the cluster. See the section on Active Anti Entropy for more details.

Cask is heavily inspired by Tahoe-LAFS and Riak.

Clone this wiki locally