The Value of Log Structured Merge Trees
November 20, 2015 |
Indexes are usually built by way of a data structure; typically, that structure takes the form of a “tree.” Most commonly, the structure of choice is a B-Tree, which is a hierarchical organization defined by the arrangement and interactions of its nodes.
High in a B-Tree’s hierarchy, you find a root node. Each of the items in that root node points to a collection of items in a child node. This can go on for multiple levels. Eventually, you reach the leaf nodes, where the data itself is stored. Those leaf nodes point back to the rows in the main table. To be technology specific, for instance, in MyISAM, those leaf nodes have offsets in the main data file – the main data file in a MyISAM table is just a sequential collection of rows.
The purpose of an index is to allow a user to very quickly traverse just such a structure – whatever form that structure or algorithm might take – and to locate and examine rows or ranges of rows. The bottomline goal is to reduce the cost required of your system for users to access such rows and their data.
A B-Tree is multi-purpose and easily the most popular index data structure for MySQL – as such, it’s become the default structure for most relational databases: SQL server, Oracle, and pretty much every other database technology that might come to mind. The primary characteristics of a B-Tree are that its leaf nodes contain rows in sorted order, it supports looking up a single row – finding out whether it exists or not – and finding out where to look for that row in the table; it also supports looking up ranges of rows, scanning in order within a range of rows, including scanning in order from beginning to end, or some subset. In most cases, indexes are actually doubly linked so that you can sort one direction but traverse in both directions, essentially allowing you to scan in reverse, if necessary.
B-Trees have been around, both conceptually and in practice, for decades. A B-Tree is not the only option, however: an important alternative is called a Log Structured Merge (LSM) Tree, a tree structure that’s only been around since the turn of the millennium, or so. Unlike a B-Tree, which has a reference implemetation – a single canonical implementation, such that you might see it in a computer science course in a University – the LSM Tree has a variety of implementations, and the difference between one LSM Tree and another can be quite notable.
An LSM Tree consists of an entire family of related data structures and alogrithms. In general, the principle behind an LSM Tree is that its data is written in such a way as to make that data immutable – if a change is necessary, instead of replacing older records, a new layer is added atop. Therefore, when a data segment fills, it becomes inactive and read-only after that point; eventually, older segments merge together to form lower levels, becoming more and more vast. So, in terms of heirarchy, when using an LSM you see a lot of common characteristics of a tree, but there are also many subtle nuances, often making LSM Trees much more complex than a B-Tree.
So why care about LSM Trees? Consider how B-Trees update data in-place: they require reading data before you write it, resulting in an array of potential issues, such as page splitting, page merging, etc., requiring blocking or concurrency control. What results is a great deal of overhead, meaning that B-Trees are much faster in memory than on disk. Log Structured Merge Trees, however, avoid the catastrophic drop-off of performance when the working set of data gets larger than memory and has to spill onto disk. If you compare Percona’s benchmarks of InnoDB and TokuDB, you’ll see InnoDB has a sheer drop where performance plummets, while TokuDB performance goes flat and steady for a pretty long time. This is precisely the sort of nuance and idiosyncracy that can highly recommend an LSM Tree over a B-Tree – as powerful as a classic, canonical option might be, something newer and more unusual can sometimes be just what a system needs.