B+ Trees
A B+ tree is an \(n\)-ary tree with a variable number of children per node. A B+ tree has the following characteristics:
- All data is stored in leaf nodes.
- Every leaf node is at the same level (where the level of a node is 1 + the number of edges between the node and the root).
- The number of children, \(n\), of each internal node except the root is \(\lceil \frac{b}{2} \rceil \leq n \leq b\) where \(b\) is the order (or branching factor) of the tree. An internal node is any node of the tree that has children – i.e., any node that is not a leaf. The root is permitted to have as few as two children.
- All keys are sorted in increasing order.
Typically the order of the B+ tree is very high (e.g., 100s of children per node), which reduces the number of input / output (I/O) operations required to find an element in the tree.
Assume that we have the following relational table:
Name | Revenue (USD millions) | Units Sold (millions) |
---|---|---|
M&M’s | 406.7 | 417.7 |
Reese’s Peanut Butter Cups | 398.9 | 355.2 |
Snickers | 386.2 | 405.3 |
Hershey’s Milk Chocolate Bar | 249.0 | 264.6 |
Kit Kat | 201.8 | 202.5 |
Twix | 169.9 | 153.3 |
3 Musketeers | 100.7 | 102.5 |
Hershey’s Cookies ‘n’ Creme | 80.4 | 80.1 |
Milky Way | 70.2 | 84.0 |
Almond Joy | 60.8 | 62.2 |
… | … | … |
An example B+ tree of order 4 for America’s 10 Favorite Chocolate Candies is as follows:
Notice that the tree preserves all the constraints of a B+ tree:
- The four leaves contain all ten chocolate candies—three entries (3 Musketeers, Almond Joy, and Hershey’s Milk Chocolate Bar) in the first leaf, two in the second leaf, three in the third leaf, and two in the fourth leaf.
- Every leaf is in the second level of the tree. That is, all leaves are the same distance from the root.
- Each internal node has 2–4 children. (The number of children is essentially analogous to the number of pointers being used.) For example, the root has 4 children (the leaves). The leaves are each half full: two leaves contain three keys and the other two leaves contain two keys.
- All the keys are sorted in increasing order. For example, M&M’s appears before Milky Way in the third leaf from the left, and Almond Joy appears to the left of M&M’s.
Each leaf node contains a pointer to the location of the record to provide direct access to the entry (i.e., on-disk location) in the relational table. For example, the record for 3 Musketeers is at the physical address 0x516f0feb. Similarly, the record for Reese’s Peanut Butter Cups is at the address 0x516fa99d. A database management system (DBMS) uses this location to access the record when processing queries. The Structured Query Language (SQL) query
SELECT *
FROM candy
WHERE name = 'Milky Way';
would use the B+ tree to find the entry for Milky Way and retrieve its record, which is stored at location 0x516f84d9.
To facilitate sequential access, all the leaves form a linked list. Notice that the last pointer of the leftmost leaf node references the second leaf node and similarly for the remaining leaves. This design allows a B+ tree to provide efficient random and sequential access to the values.
Because the prior diagram is fairly complex due to the number of pointers, in many cases the pointers in leaves are omitted.
(Click on the diagram to view a larger version.) Notice that the structure of the B+ tree is identical, but this version more clearly indicates each level of the tree, including that all the leaves are at the same level.
Algorithms
Like other types of database indexes, B+ trees support three primary operations:
A high-level overview of each operation follows.
Search
To search a B+ tree, start at the root node and compare the “needle” (i.e., the value being sought) to each key. If the key is larger than the needle, then repeat the search on the child tree referenced by the prior pointer. For example, the following comparisons are made when searching for Milky Way:
- Comparisons within the root node
- Hershey’s Cookies ‘n’ Creme < Milky Way
- M&M’s < Milky Way
- Milky Way < Snickers
- Comparisons within the leaf
- M&M’s < Milky Way
- Milky Way = Milky Way
and at this point, the desired value has been found.
Insert
When inserting new entries, the aforementioned characteristics of the B+ tree must be preserved.
In the simplest case, there is sufficient space in the relevant leaf node to store the value being inserted. For example, inserting Hershey’s Special Dark simply shifts Kit Kat one position to the right in the leaf.
If the leaf has no more space, which would be the case when inserting Nestle Crunch, the leaf must be split into two leaves, each of which contains half the current keys. Such a split requires modifying the parent node, which, in this instance, causes the (current) root to also split and forces the creation of a new root node with two children. The resulting tree has three levels.
Delete
Like insertion, deleting entries requires preserving the characteristics of the B+ tree. First, the key (i.e., the entry to be removed) must be found in a leaf node and removed. If the leaf remains at least half full following the deletion, then additional changes are only necessary if that same key appears in higher levels of the tree, such as deleting M&M’s. If the removal leaves the leaf less than half full, then the values are coalesced with an adjacent leaf and redistributed among the two leaves if the total number of values exceeds the capacity of a single leaf.
Practice
Complete the following exercises from Database System Concepts. Because higher order trees require fewer “rotations” to keep the tree balanced, the B+ tree of order 8 will likely be easier than the B+ tree of order 4. (solutions)
-
Construct a B+ tree for the following set of key values: {2, 3, 5, 7, 11, 17, 19, 23, 29, 31}. Assume that the tree is initially empty and values are added in ascending order. Construct B+ trees for the cases where the number of pointers that will fit in one node [i.e., order] is as follows (Exercise 11.3):
-
Four
-
Six
-
Eight
-
-
For each B+ tree of the prior exercise, show the form of the tree after each of the following series of operations (Exercise 11.4):
- Insert 9
- Insert 10
- Insert 8
- Delete 23
-
Delete 19
-
B+ tree of order 4
-
Insert 9
-
Insert 10
Because the addition of the 10 exceeds the capacity of the second leaf from the left, the leaf is split into two leaves, each of which takes half the keys. The parent node must also be modified to reference both leaves.
-
Insert 8
-
Delete 23
When 23 is deleted, the leaf is no longer half full. Therefore, the remaining key, 19, is combined with the adjacent leaf (of the same parent). The removal of a leaf causes the parent node to have fewer than the required number of children (2 for a B+ tree of order 4). Consequently, it must also be combined with its sibling and the keys redistributed among the two nodes because there are too many keys for a single node.
-
Delete 19
-
-
B+ tree of order 6
-
Insert 9
-
Insert 10
-
Insert 8
-
Delete 23
-
Delete 19
-
-
B+ tree of order 8
-
Insert 9
-
Insert 10
-
Insert 8
-
Delete 23
-
Delete 19
-
-