B-Trees, large volumes of data and branching factor

B-trees can be slow

I've noticed that after a few million keys, almost all b-tree based databases starts to get slower. This applies to both Relational/ACID-compliant as well as the NoSQL DBs. I tried Harbour's BTree implementation, TokyoCabinet, SQLite and MySQL. All of them started to slow down at various stages and the maximum I could go without an unacceptable decrease in performance was close to 40 million keys with TokyoCabinet in my 2GB RAM laptop.

Large branching factor = better performance

A binary tree is a tree which has two children per node, and has a time complexity of O (log₂ N) for any search operation. B-tree is a variation of binary tree in which there are M number of children per node. It thus achieves a time complexity of O (log_M N) for each search operation. M is called fanout or Branching factor. A large branching factor is the reason why the b-tree is a fast data structure.

When the number of keys (N) crosses a certain bound, the height of the tree will increase and every search/insert might need to traverse one more level. If this level is not in memory, the operation will need a disk access. Since disk access is expensive, this leads to bad performance. Thus,

Efficiency of a B-tree ∝ Available memory / F(N).

F(N) is a function on the number of keys and gives the height of the b-tree.

This leads us to the branching factor - given enough memory, increasing the branching factor will make each level of the tree hold more nodes and thus reduce the b-tree's height. Fewer levels means fewer disk access (if any) and better performance.

Disproportionally large branching factor = decreased performance

In an ideal world, to cope with larger values of N, we could simply increase the branching factor and individual node capacity to maintain the performance of the B-tree.

But the b-tree structure becomes a burden if the branching factor is so big that the first few levels cannot be completely held in memory. At this stage, accessing each level can trigger a page-fault causing disk swapping and very poor performance.

Two major concerns here are :

1. Limited Memory: The b-tree parameters are limited by the available memory. Since all search starts from the top, we need the first level of nodes to be completely in memory. More the number of nodes thats fits in main memory, the lesser the need for disk seeks. But scaling physical memory is an expensive proposition and infeasible beyond a point.

Solution: Sharding

Beyond splitting db load, there is a huge advantage with sharding - sharding a b-tree storage enables us to limit the number of keys that a single b-tree need to hold. Thus each b-tree will produce optimal performance.

Sharding the database along with its b-tree indices into different physical machine is one of the most effective scaling solutions for growing data volumes.

MongoDB uses this approach, and seems to be scaling very well judging from a few instances of MongoDB in production use.

2. Databases need constant tuning: Changing the database system parameters (branching factor, shards, indexes) etc. usually require downtime. Having scaling related down-times is a major pain and we are still yet to see a db that tunes itself to increasing data volumes.

Solution: Cache-oblivious structures and algorithms

The implementation of most algorithms that deal with large volumes of data makes advantage of the caching characteristics of the target machine. For example in B-trees, the branching factor should be such that a complete level of node should completely fit in the main memory. Cache-oblivious algorithms remove the dependency on the cache characteristics of the platform and implement data structures like b-tree to perform well in diverse memory architectures.

Prof. Charles E. Leiserson of MIT heads the Supercomputing technologies group which is doing pioneering research into cache-oblivious algorithms. There are quite a few papers on this topic at the MIT SuperTech website.

The primary authors of the cache-oblivious B-tree papers at CSAIL, Michael A. Bender, Martin Farach-Colton and Bradley Kuszmaul have founded Tokutek, which built a product based on cache-oblivious b-trees - TokuDB an add-on database engine to MySQL that implements fractal tree which is claimed to perform orders better than either InnoDB/MyISAM at high data volumes. Though shipped with MySQL, the implementation is not open-source.

Conclusion

B-trees are one of the most used data-structures in computing. However they are not a silver-bullet for every kind of data storage and retrieval problem. Developers still need to go down a few levels of abstraction and tune the database to gain the most from the DB. Sharding is a viable solution but it requires the application to be tailored to shard the dataset according to properly identified key fields. Cache-oblivious b-trees are yet to become widely available in mainstream databases.

It has been about 30 years since the first commercial RDBMS (Ingres) by Michael Stonebraker became popular, and the relational model of database with b-tree as its preferred storage structure chugged along. However in recent years, the demands of realtime web-scale applications have paved way for more research and changes in the database landscape.