Sunday, July 26, 2009

Hay-O! NSFW comic strip re: Key-Value stores

How do I query the database? - Click thru for the comic - uses the F-word, so if that offends you, now's the time to f-off.

How do I query the database?

This is from the highscalability site, who have been doing a fine job chronicling the revolution, such as it is.

Sunday, July 5, 2009

When you know what you're after and you have to scale: distributed key-value stores

Last time we talked about key/value databases like Berkeley DB and more recent variants like the Tokyo Tyrant. On the one (very small) end, these can make for really nice, in process datastores. On the other end, for very very large databases with high performance and scaling needs, these key-value stores have found a niche as well.

Do you really need mad scaling capability?

It's probably not the case that your political blog needs to scale to serve millions of users, even if you have some readers who are not your relatives or friends, which would already place your blog in the top 10% of blogs. Additionally, although 'we're not building' is sometimes used as an excuse for neglecting performance considerations, it is really safe to say you are not building Your needs are likely different. Ted Dziubia has covered this nicely in his blog post: 'I'm going to scale my foot up your ass'.

So why bother talking about it at all, then?

Academic interest. The pursuit of general well-roundedness. Boredom with writing for the more loosey goosey blog mentioned in my intro post. Perhaps I'm turning into one of the clowns I knew 10 years ago who always was installing the latest version of Red Hat, but never was in fact using Red Hat. I hope not. Also, who knows, you aren't building, but you may end up at a place using these things some day, or more likely, you may be using such technology via the much beloved 'cloud'.

Amazon's Dynamo

The people who actually were building came up with a data storage system called Dynamo. If you'd like to read a paper describing it, from the source, you can download the pdf.

According to the paper, many of Amazon's services work just fine with a key-value store and don't necessarily need a relational database schema or flexible, declarative rather than imperative query language like SQL. This makes sense as Amazon's system is doing things like looking up products by key, storing shopping carts according to keys for customers, etc and so on. The really big goal, again, is scaling and efficiency, and according to the paper, systems with ACID properties tend to suffer where availability is concerned (thanks to locking needs and so forth - the thought of handling locks across hundreds or thousands of commodity servers in multiple data centers is headache inducing). It's said that you have to pick at most two from reliability, availability and consistency. Amazon trades consistency for availability and reliability (resiliance in the face of node failure) where needed.

Eventual Consistency

Consistency is not completely thrown out the window. Dynamo has a property called 'Eventual Consistency'. All updates 'eventually' reach all nodes. High write availability is a goal of the system, which is easy enough to understand - forcing a user to wait until all nodes know that 'Infinite Jest' has been added to the shopping cart would be a good way to lose customers. There will be inconsistencies, but resolution of the inconsistencies is pushed off for reads to handle.

The next question to address is whether the data store or the application is supposed to handle resolution. The author of an application (ideally) has knowledge of the data and how it is used, and would be perhaps best qualified to determine a resolution strategy. On the other hand, this sort of micro-management is a recipe for inconsistency across applications and may not be the best use of developer time.

Dynamo stamps updates with something called a vector clock (while it'd be nice to stamp updates with a time stamp from a universal clock, unfortunately a universal clock doesn't exist in practice). The vector clocks can be used to determine which update is the latest. This is discussed in more detail in a post on Paper Trail. To cram the idea into a sentence, the dimensions of the vector are the servers, and the values of the dimensions are from an internal clock (or event counter) maintained by the server in question.

Usually this is sufficient to determine which version is the latest, and this is referred to as syntactic reconciliation. When this can't be done, and the application has to handle it (for example, the contents of 2 versions of a cart might be merged), it's called semantic reconciliation. Feel free to use those terms to show off, but whatever you do, don't spell it 'symantec reconciliation', or people will laugh at you behind your back, much like we laughed at the pompous consultant who did that in one of his masterful 'roadmaps'.


One of the goals was to develop a system where nodes can be added as needed and failures can be handled smoothly. Also, the possibility of having nodes of various power needed to be factored in. To that end, an Amazon-ized version of consistent hashing (a really good write-up on consistent hashing by somebody who's used it in the field may be found here) is used to partition the system, and instead of considering all nodes equal, virtual nodes are used - more powerful servers can be responsible for more virtual nodes.

This being a key/value data store, and there being a hash function that's used to map the key to a 128-bit value, when we say 'partition the system', we're partitioning the key space. A number of virtual nodes (the 'preference list') are responsible for a range of keys, and that data is replicated across the first N (a parameter that's configured per-instance) nodes. N < (number of nodes a key maps to), to account for failures. Puts and gets are handled by the first of the top N nodes in the preference list. To maintain consistency, for a read, at least R nodes have to participate. For a read, W nodes. If R + W > N, congratulations, you have a 'sloppy' quorum system. R, W, and N are user configurable, with a typical choices being N = 3, R = 2, W = 2.

In Amazon's case, the data is replicated across datacenters, for big-time fault tolerance. So again, a Mom and Pop shop is probably not going to have multiple datacenters, but even a Mom and Pop shop will be able to enjoy this kind of availability as cloud computing becomes more available and mainstream. Unfortunately this deprives the IT guy of the opportunity to swing his dick around bragging about his datacenters. You win some, you lose some.

Where does Berkeley DB fit in, again?

According to the paper, different storage engines can be plugged in to this framework. The engines include Berkeley Database (BDB) Transactional Data Store, BDB Java Edition, MySQL, and an in-memory buffer with persistent backing store. Berkeley DB Transactional Data Store is the most common. Dynamo-esque systems like LinkedIn's Voldemort also provide this sort of plug-in framework.

What about security?

It's assumed all of this is happening within an already secure, trusted environment. Only systems inside of Amazon are accessing this directly.

So that's it?

Obviously that's not all there is to Dynamo, and even the paper is an overview. It is a nice jumping off point when considering related #nosql options, especially considering that there are a number of Dynamo-like systems out there.

A shortcoming of these systems is that they're not particulary good for reporting applications or ad-hoc querying. This is fine and dandy, since that's not primarily what they're designed to do, but in the future we'll talk about Map Reduce and other tools used to that end.

That is all for now.