B+ trees
- Expected duration
- 5–8 hours
- Deadline
- 2300 on Lesson 36 (T-day)
- Points
- 75 points
Learning Objectives
- Analyze the performance characteristics of an index
- Manipulate a B+ tree by inserting and deleting data
Help Policy
- Authorized Resources
- References regarding B+ trees except source code
- e.g., the following resources are all authorized:
- pseudocode is authorized but must be documented with the exception of those resources explicitly noted below
-
For the programming portion of this assignment, you may work with a partner who is currently enrolled in this course.
- Notes
- Never copy another person’s work and submit it as your own
-
You must document all help received from all sources, including the instructor and instructor-provided course materials (such as the textbook)
- Exception
- You need not document the B+ tree handout, implementation notes, or excerpt from Database System Concepts
Instructions
This assignment has two parts that are independent of each other. The first part comprises short answer questions that are to be completed individually. The second part requires programming and may be completed with a partner.
Short Answer
Answer the following questions individually:
In the questions that follow, KiB refers to Kibibyte, which is \(2^{10} = 1024\) bytes.
- A balanced binary tree minimizes the number of comparisons required to find
an element. Why do databases use B+ trees (or other types of trees with
higher branching factors such as B-trees) instead of binary trees to index
data?
(5 points)
- Note: For full credit, your response must consider more than simply the asymptotic time complexity of binary and m-ary trees.
- Sketch a proof that inserting a record in a B+ tree requires \(O(\log_b n)\) operations where b is the branching factor and n is the total number of records. (5 points)
- Assume that a B+ tree is being used to index
INTEGER
s, mapping each key to a 64-bit address of the corresponding record. (For example, this index might map user id 27983 to the corresponding user record, one for John Doe.) What branching factor should be chosen to ensure that each node fills a 4 KiB page, assuming that anINTEGER
is 32-bits? (5 points) -
A simplistic query planner works as follows:
The query planner considers two parameters: the cost of a sequential disk access and the cost of a random disk access. Sequential access has unit cost, and a random disk access is 4 times more expensive than sequential access. Thus, sequentially accessing two pages from disk requires 2 time units (i.e., the time to read two pages sequentially from the disk) whereas using an index to find two records on different disk pages requires 8 time units (i.e., the time to read two non-sequential disk pages), assuming that the index is already in memory.
Answer the following questions (note that each question is independent of the others):
- How many tuples fit into a 4 KiB disk page assuming that each tuple requires 36 Bytes of storage? (2 points)
- A table requires 326 pages to store its tuples, and these pages are stored sequentially on disk. What is the estimated cost to retrieve all the tuples from disk? (2 points)
- A table requires 326 pages to store its tuples, and these pages are stored sequentially on disk. In addition, an index exists over these tuples, and this index is stored in memory. What is the estimated cost to retrieve 47 tuples using the index, assuming that no two tuples are stored in the same page? (2 points)
- A table requires 4817 pages to store its tuples, and these pages are stored sequentially on disk. In addition, an index exists over these tuples, and this index is stored in memory. What is the maximum number of tuples that can be retrieved before a sequential scan of the entire table is more efficient than using the index, assuming that no two of the retrieved tuples are stored on the same page? (4 points)
Programming
You are strongly encouraged to complete this portion of the assignment with a partner who is currently enrolled in this course. You may complete it by yourself if you want, but be forewarned that the assignment is likely to take much longer to complete individually.
When accepting the assignment on GitHub Classroom, you will be prompted to create or join a team. You should coordinate with a partner prior to accepting the assignment so that the first partner to accept the assignment creates the team and the second partner joins that team. (Even if working independently, you must still create a team.) Do not join a team without prior coordination, and be sure to join the correct team! GitHub Classroom’s support for group assignments remains limited, which complicates making changes if you make a mistake.
Implement a B+ tree that supports the insert
and delete
operations. (These
operations implicitly require support for find
.) In addition, you must print
the structure and contents of the B+ tree, which will be used for automated
testing (and debugging during development!). Because the invariants of a B+
tree may be satisfied with subtly different semantics (e.g., whether to attempt
to merge or redistribute entries first after deletion), your implementation
must follow the algorithms presented in Database System
Concepts, which are reproduced in the
implementation notes for B+ trees (along with
several corrections for published errata). You should also review the
excerpt from Database System Concepts, which
provides more detail about the various algorithms.
While you are permitted to consult other references regarding B+ trees, others’ algorithms may be subtly different while still maintaining the same invariants. Such differences will cause the structure of the B+ tree to vary, preventing you from passing the automated tests. Following the aforementioned implementation notes should eliminate (or at least minimize) such discrepancies.
Your implementation of a B+ tree will be used by a driver program (e.g., “main”), which will create, manipulate, and print the final state of the B+ tree. More specifically, the driver program will read from standard input the branching factor (or order) of the B+ tree followed by zero or more operations on the B+ tree. The order is an integer n, \(2 \leq n\). An operation is either an insert or a delete and is designated as follows:
- insert (
+ key,value
) - Insert
key
with the specifiedvalue
in the B+ tree - delete (
- key
) - Delete
key
from the B+ tree
When standard input is exhausted, the program should print the final state of the B+ tree, listing each node of the B+ tree with its keys and, for leaves, the value associated with each key. Each node is labeled with a unique identifier that identifies the location of the node where “1” indicates the root, “1.1” the first child of the root, “1.2” the second child of the root, etc. Each node label is followed by a comma-delimited list of that node’s keys in square brackets (the representation of a list in JSON).
The following examples should clarify the format of input and expected output:
- Example (empty tree)
-
4
creates a B+ tree of order 4. No operations are performed on the tree. Consequently, the tree has an empty root node. The output of the program should be as follows:
1: []
which indicates that the tree contains a single node (i.e., the root) that is empty.
- Example (inserts)
-
4 + 1,one + 2,two + 3,three
creates a B+ tree of order 4 and performs the following insertions on that tree:
- key “1” with value “one”
- key “2” with value “two”
- key “3” with value “three”
The output of the program should be as follows:
1: [1, 2, 3] 1: one 2: two 3: three
where the root, which is also a leaf, contains the keys 1, 2, and 3. The value associated with each key appears on the three remaining lines.
- Example (inserts and deletes)
-
8 + 1,one + 2,two + 3,three + 4,four + 5,five + 6,six + 7,seven + 8,eight + 11,eleven + 12,twelve + 13,thirteen + 14,fourteen + 9,nine + 10,ten - 4 - 11
creates a B+ tree of order 8 and performs the specified operations on that tree. Note that the operations are not necessarily ordered by the key being inserted or deleted.
The output of the program should be as follows:
1: [6, 10] 1.1: [1, 2, 3, 5] 1: one 2: two 3: three 5: five 1.2: [6, 7, 8, 9] 6: six 7: seven 8: eight 9: nine 1.3: [10, 12, 13, 14] 10: ten 12: twelve 13: thirteen 14: fourteen
Here, the root node contains two keys (6 and 10) and three children. All the children of the root are leaves. The first leaf (identified by the label 1.1) contains the keys 1, 2, 3, and 5.
To simplify your implementation, you may assume the following:
- All keys are 32-bit integers
- All keys are unique (though the same key may be inserted, deleted, and re-inserted with a different value)
- All values are strings (which are much better for debugging than pointers to records)
- Values may not be unique
- All input is well-formed and follows the previously-described format
- All operations are “valid” – e.g., a key will not be deleted unless it already exists in the tree
You may use any programming language for your implementation. For automated
testing, your program must be compiled without user intervention (e.g., using
make
) or executed by an interpreter without compilation.
Testing
Your program must read from standard input and output a B+ tree to standard output. The prior examples cover the input and output format in greater detail.
Because you are permitted to use any programming language for this assignment,
the test script requires two modifications to execute your code (see the “TODO”
comments in the script). First, if you are using a compiled language, then you
must insert the commands to compile your source code. These commands must
execute without manual intervention by the CI service – e.g., you might
include a Makefile and execute make
to compile your code. Second, you must
provide the path of your executable program (which may have been built in the
prior step). The test script will run this program and compare its output
against the provided test cases.
You are not expected to be an expert with the CI system. You are expected to know how to compile your code outside an IDE or, at a minimum, know which tools your IDE is using and verify that those tools can be executed from the command line. Contact the instructor if you need assistance, particularly if additional packages should be installed to compile your code.
It is recommended that you develop your code inside the virtual machine (VM) created for this course, as the CI service’s environment is much more similar to the VM than a traditional desktop. In addition, you can (and should!) execute the tests locally once you have configured the test script appropriately.
Your program must reproduce the expected output exactly to pass the automated tests.
If you are failing a test case, the first thing to check is the structure of your B+ tree. Does it maintain all the invariants (see handout for a succinct list). If all the invariants hold, look for an off-by-one error when splitting or merging nodes – e.g., an entry may appear in a sibling of the correct node.
Examples
Each of the following examples inserts the keys 2, 3, 5, 7, 11, 17, 19, 23, 29, 31, 9, 10, and 8 and deletes the keys 23 and 19.
- B+ tree of order 4
- Given the following input
4 + 2,two + 3,three + 5,five + 7,seven + 11,eleven + 17,seventeen + 19,nineteen + 23,twenty-three + 29,twenty-nine + 31,thirty-one + 9,nine + 10,ten + 8,eight - 23 - 19
your program should output
1: [11] 1.1: [5, 9] 1.1.1: [2, 3] 2: two 3: three 1.1.2: [5, 7, 8] 5: five 7: seven 8: eight 1.1.3: [9, 10] 9: nine 10: ten 1.2: [19] 1.2.1: [11, 17] 11: eleven 17: seventeen 1.2.2: [29, 31] 29: twenty-nine 31: thirty-one
which represents the following B+ tree:
- B+ tree of order 6
- Given the following input
6 + 2,two + 3,three + 5,five + 7,seven + 11,eleven + 17,seventeen + 19,nineteen + 23,twenty-three + 29,twenty-nine + 31,thirty-one + 9,nine + 10,ten + 8,eight - 23 - 19
your program should output
1: [7, 10] 1.1: [2, 3, 5] 2: two 3: three 5: five 1.2: [7, 8, 9] 7: seven 8: eight 9: nine 1.3: [10, 11, 17, 29, 31] 10: ten 11: eleven 17: seventeen 29: twenty-nine 31: thirty-one
which represents the following B+ tree:
- B+ tree of order 8
- Given the following input
8 + 2,two + 3,three + 5,five + 7,seven + 11,eleven + 17,seventeen + 19,nineteen + 23,twenty-three + 29,twenty-nine + 31,thirty-one + 9,nine + 10,ten + 8,eight - 23 - 19
your program should output
1: [11] 1.1: [2, 3, 5, 7, 8, 9, 10] 2: two 3: three 5: five 7: seven 8: eight 9: nine 10: ten 1.2: [11, 17, 29, 31] 11: eleven 17: seventeen 29: twenty-nine 31: thirty-one
which represents the following B+ tree:
Submission
Complete the Blackboard assessment to submit both parts of this assignment. You must complete the short answer portion individually. Both partners must submit a copy of their source code for the programming portion (i.e., you must submit your source code even if you worked with a partner).
For the programming portion, create an archive of your Git repository (you can use GitHub’s “Clone or download” button when viewing your repository for this purpose) and submit that archive. GitHub Classroom also tags the latest commit at the due date for the assignment.
Grading
The following grading rubric will be used for this assignment:
Description | Points |
---|---|
Short Answer | 25 |
Individual questions | varies |
Programming | 50 |
Automated tests (25 @ 2 points each) | 50 |
The automated tests will be executed automatically when you push to your GitHub repository. Each test case is worth the same amount. No partial credit will be awarded for the automated tests.
Given the potential for an infinite loop, the test harness kills the program when its execution time exceeds a timeout. You should not modify or disable this timeout, as it should be sufficient to pass all the tests.
If your CI jobs block others’ jobs beyond the maximum execution time allowed before automatic termination, you will be penalized \(2^{(n - 1)}\) points where n is the number of times that your jobs stalled the test queue and required manual intervention to resolve. Test locally to avoid this issue!