Expected duration
5–8 hours
Deadline
2300 on Lesson 36 (T-day)
Points
75 points

Learning Objectives

  • Analyze the performance characteristics of an index
  • Manipulate a B+ tree by inserting and deleting data

Help Policy

Authorized Resources
References regarding B+ trees except source code

For the programming portion of this assignment, you may work with a partner who is currently enrolled in this course.

Notes
Never copy another person’s work and submit it as your own

You must document all help received from all sources, including the instructor and instructor-provided course materials (such as the textbook)

Exception
You need not document the B+ tree handout, implementation notes, or excerpt from Database System Concepts

Instructions

This assignment has two parts that are independent of each other. The first part comprises short answer questions that are to be completed individually. The second part requires programming and may be completed with a partner.

Short Answer

Answer the following questions individually:

In the questions that follow, KiB refers to Kibibyte, which is \(2^{10} = 1024\) bytes.

  1. A balanced binary tree minimizes the number of comparisons required to find an element. Why do databases use B+ trees (or other types of trees with higher branching factors such as B-trees) instead of binary trees to index data? (5 points)
    • Note: For full credit, your response must consider more than simply the asymptotic time complexity of binary and m-ary trees.
  2. Sketch a proof that inserting a record in a B+ tree requires \(O(\log_b n)\) operations where b is the branching factor and n is the total number of records. (5 points)
  3. Assume that a B+ tree is being used to index INTEGERs, mapping each key to a 64-bit address of the corresponding record. (For example, this index might map user id 27983 to the corresponding user record, one for John Doe.) What branching factor should be chosen to ensure that each node fills a 4 KiB page, assuming that an INTEGER is 32-bits? (5 points)
  4. A simplistic query planner works as follows:

    The query planner considers two parameters: the cost of a sequential disk access and the cost of a random disk access. Sequential access has unit cost, and a random disk access is 4 times more expensive than sequential access. Thus, sequentially accessing two pages from disk requires 2 time units (i.e., the time to read two pages sequentially from the disk) whereas using an index to find two records on different disk pages requires 8 time units (i.e., the time to read two non-sequential disk pages), assuming that the index is already in memory.

    Answer the following questions (note that each question is independent of the others):

    1. How many tuples fit into a 4 KiB disk page assuming that each tuple requires 36 Bytes of storage? (2 points)
    2. A table requires 326 pages to store its tuples, and these pages are stored sequentially on disk. What is the estimated cost to retrieve all the tuples from disk? (2 points)
    3. A table requires 326 pages to store its tuples, and these pages are stored sequentially on disk. In addition, an index exists over these tuples, and this index is stored in memory. What is the estimated cost to retrieve 47 tuples using the index, assuming that no two tuples are stored in the same page? (2 points)
    4. A table requires 4817 pages to store its tuples, and these pages are stored sequentially on disk. In addition, an index exists over these tuples, and this index is stored in memory. What is the maximum number of tuples that can be retrieved before a sequential scan of the entire table is more efficient than using the index, assuming that no two of the retrieved tuples are stored on the same page? (4 points)

Programming

You are strongly encouraged to complete this portion of the assignment with a partner who is currently enrolled in this course. You may complete it by yourself if you want, but be forewarned that the assignment is likely to take much longer to complete individually.

When accepting the assignment on GitHub Classroom, you will be prompted to create or join a team. You should coordinate with a partner prior to accepting the assignment so that the first partner to accept the assignment creates the team and the second partner joins that team. (Even if working independently, you must still create a team.) Do not join a team without prior coordination, and be sure to join the correct team! GitHub Classroom’s support for group assignments remains limited, which complicates making changes if you make a mistake.

Implement a B+ tree that supports the insert and delete operations. (These operations implicitly require support for find.) In addition, you must print the structure and contents of the B+ tree, which will be used for automated testing (and debugging during development!). Because the invariants of a B+ tree may be satisfied with subtly different semantics (e.g., whether to attempt to merge or redistribute entries first after deletion), your implementation must follow the algorithms presented in Database System Concepts, which are reproduced in the implementation notes for B+ trees (along with several corrections for published errata). You should also review the excerpt from Database System Concepts, which provides more detail about the various algorithms.

While you are permitted to consult other references regarding B+ trees, others’ algorithms may be subtly different while still maintaining the same invariants. Such differences will cause the structure of the B+ tree to vary, preventing you from passing the automated tests. Following the aforementioned implementation notes should eliminate (or at least minimize) such discrepancies.

Your implementation of a B+ tree will be used by a driver program (e.g., “main”), which will create, manipulate, and print the final state of the B+ tree. More specifically, the driver program will read from standard input the branching factor (or order) of the B+ tree followed by zero or more operations on the B+ tree. The order is an integer n, \(2 \leq n\). An operation is either an insert or a delete and is designated as follows:

insert (+ key,value)
Insert key with the specified value in the B+ tree
delete (- key)
Delete key from the B+ tree

When standard input is exhausted, the program should print the final state of the B+ tree, listing each node of the B+ tree with its keys and, for leaves, the value associated with each key. Each node is labeled with a unique identifier that identifies the location of the node where “1” indicates the root, “1.1” the first child of the root, “1.2” the second child of the root, etc. Each node label is followed by a comma-delimited list of that node’s keys in square brackets (the representation of a list in JSON).

The following examples should clarify the format of input and expected output:

Example (empty tree)
4

creates a B+ tree of order 4. No operations are performed on the tree. Consequently, the tree has an empty root node. The output of the program should be as follows:

1: []

which indicates that the tree contains a single node (i.e., the root) that is empty.

Example (inserts)
4
+ 1,one
+ 2,two
+ 3,three

creates a B+ tree of order 4 and performs the following insertions on that tree:

  • key “1” with value “one”
  • key “2” with value “two”
  • key “3” with value “three”

The output of the program should be as follows:

1: [1, 2, 3]
  1: one
  2: two
  3: three

where the root, which is also a leaf, contains the keys 1, 2, and 3. The value associated with each key appears on the three remaining lines.

Example (inserts and deletes)
8
+ 1,one
+ 2,two
+ 3,three
+ 4,four
+ 5,five
+ 6,six
+ 7,seven
+ 8,eight
+ 11,eleven
+ 12,twelve
+ 13,thirteen
+ 14,fourteen
+ 9,nine
+ 10,ten
- 4
- 11

creates a B+ tree of order 8 and performs the specified operations on that tree. Note that the operations are not necessarily ordered by the key being inserted or deleted.

The output of the program should be as follows:

1: [6, 10]
  1.1: [1, 2, 3, 5]
    1: one
    2: two
    3: three
    5: five
  1.2: [6, 7, 8, 9]
    6: six
    7: seven
    8: eight
    9: nine
  1.3: [10, 12, 13, 14]
    10: ten
    12: twelve
    13: thirteen
    14: fourteen

Here, the root node contains two keys (6 and 10) and three children. All the children of the root are leaves. The first leaf (identified by the label 1.1) contains the keys 1, 2, 3, and 5.

To simplify your implementation, you may assume the following:

  • All keys are 32-bit integers
  • All keys are unique (though the same key may be inserted, deleted, and re-inserted with a different value)
  • All values are strings (which are much better for debugging than pointers to records)
  • Values may not be unique
  • All input is well-formed and follows the previously-described format
  • All operations are “valid” – e.g., a key will not be deleted unless it already exists in the tree

You may use any programming language for your implementation. For automated testing, your program must be compiled without user intervention (e.g., using make) or executed by an interpreter without compilation.

Testing

Your program must read from standard input and output a B+ tree to standard output. The prior examples cover the input and output format in greater detail.

Because you are permitted to use any programming language for this assignment, the test script requires two modifications to execute your code (see the “TODO” comments in the script). First, if you are using a compiled language, then you must insert the commands to compile your source code. These commands must execute without manual intervention by the CI service – e.g., you might include a Makefile and execute make to compile your code. Second, you must provide the path of your executable program (which may have been built in the prior step). The test script will run this program and compare its output against the provided test cases.

You are not expected to be an expert with the CI system. You are expected to know how to compile your code outside an IDE or, at a minimum, know which tools your IDE is using and verify that those tools can be executed from the command line. Contact the instructor if you need assistance, particularly if additional packages should be installed to compile your code.

It is recommended that you develop your code inside the virtual machine (VM) created for this course, as the CI service’s environment is much more similar to the VM than a traditional desktop. In addition, you can (and should!) execute the tests locally once you have configured the test script appropriately.

Your program must reproduce the expected output exactly to pass the automated tests.

If you are failing a test case, the first thing to check is the structure of your B+ tree. Does it maintain all the invariants (see handout for a succinct list). If all the invariants hold, look for an off-by-one error when splitting or merging nodes – e.g., an entry may appear in a sibling of the correct node.

Examples

Each of the following examples inserts the keys 2, 3, 5, 7, 11, 17, 19, 23, 29, 31, 9, 10, and 8 and deletes the keys 23 and 19.

B+ tree of order 4
Given the following input
4
+ 2,two
+ 3,three
+ 5,five
+ 7,seven
+ 11,eleven
+ 17,seventeen
+ 19,nineteen
+ 23,twenty-three
+ 29,twenty-nine
+ 31,thirty-one
+ 9,nine
+ 10,ten
+ 8,eight
- 23
- 19

your program should output

1: [11]
  1.1: [5, 9]
    1.1.1: [2, 3]
      2: two
      3: three
    1.1.2: [5, 7, 8]
      5: five
      7: seven
      8: eight
    1.1.3: [9, 10]
      9: nine
      10: ten
  1.2: [19]
    1.2.1: [11, 17]
      11: eleven
      17: seventeen
    1.2.2: [29, 31]
      29: twenty-nine
      31: thirty-one

which represents the following B+ tree:

illustration of B+ tree

B+ tree of order 6
Given the following input
6
+ 2,two
+ 3,three
+ 5,five
+ 7,seven
+ 11,eleven
+ 17,seventeen
+ 19,nineteen
+ 23,twenty-three
+ 29,twenty-nine
+ 31,thirty-one
+ 9,nine
+ 10,ten
+ 8,eight
- 23
- 19

your program should output

1: [7, 10]
  1.1: [2, 3, 5]
    2: two
    3: three
    5: five
  1.2: [7, 8, 9]
    7: seven
    8: eight
    9: nine
  1.3: [10, 11, 17, 29, 31]
    10: ten
    11: eleven
    17: seventeen
    29: twenty-nine
    31: thirty-one

which represents the following B+ tree:

illustration of B+ tree

B+ tree of order 8
Given the following input
8
+ 2,two
+ 3,three
+ 5,five
+ 7,seven
+ 11,eleven
+ 17,seventeen
+ 19,nineteen
+ 23,twenty-three
+ 29,twenty-nine
+ 31,thirty-one
+ 9,nine
+ 10,ten
+ 8,eight
- 23
- 19

your program should output

1: [11]
  1.1: [2, 3, 5, 7, 8, 9, 10]
    2: two
    3: three
    5: five
    7: seven
    8: eight
    9: nine
    10: ten
  1.2: [11, 17, 29, 31]
    11: eleven
    17: seventeen
    29: twenty-nine
    31: thirty-one

which represents the following B+ tree:

illustration of B+ tree

Submission

Complete the Blackboard assessment to submit both parts of this assignment. You must complete the short answer portion individually. Both partners must submit a copy of their source code for the programming portion (i.e., you must submit your source code even if you worked with a partner).

For the programming portion, create an archive of your Git repository (you can use GitHub’s “Clone or download” button when viewing your repository for this purpose) and submit that archive. GitHub Classroom also tags the latest commit at the due date for the assignment.

Grading

The following grading rubric will be used for this assignment:

Description Points
Short Answer 25
Individual questions varies
Programming 50
Automated tests (25 @ 2 points each) 50

The automated tests will be executed automatically when you push to your GitHub repository. Each test case is worth the same amount. No partial credit will be awarded for the automated tests.

Given the potential for an infinite loop, the test harness kills the program when its execution time exceeds a timeout. You should not modify or disable this timeout, as it should be sufficient to pass all the tests.

If your CI jobs block others’ jobs beyond the maximum execution time allowed before automatic termination, you will be penalized \(2^{(n - 1)}\) points where n is the number of times that your jobs stalled the test queue and required manual intervention to resolve. Test locally to avoid this issue!