Project #2 - B+Tree

Do not post your project on a public Github repository.

Overview

The second programming project is to implement an index in your database system. The index is responsible for fast data retrieval without having to search through every row in a database table, providing the basis for both rapid random lookups and efficient access of ordered records.

You will need to implement B+Tree dynamic index structure. It is a balanced tree in which the internal pages direct the search and leaf pages contains actual data entries. Since the tree structure grows and shrink dynamically, you are required to handle the logic of split and merge. The project is comprised of the following tasks and it has two checkpoints.

Before starting this project, make sure you have pulled latest code from the public BusTub repo. You can do this with the following command:

git pull public master

Checkpoint #1 — Due Date: Oct 11 @ 11:59pm

Checkpoint #2 — Due Date: Oct 26 @ 11:59pm

This is a single-person project that will be completed individually (i.e., no groups).

  • Release Date: Sep 29, 2022
  • Due Date: Oct 26, 2022 @ 11:59pm

Project Specification

Like the first project, we are providing you with stub classes that contain the API that you need to implement. You should not modify the signatures for the pre-defined functions in these classes. If you do this, then it will break the test code that we will use to grade your assignment you end up getting no credit for the project. If a class already contains certain member variables, you should not remove them. But you may add functions/member variables to these classes in order to correctly realize the functionality.

The correctness of B+Tree index depends on the correctness of your implementation of buffer pool, we will not provide solutions for the previous programming projects. Since the first checkpoint is closely related to the second checkpoint in which you will implement index crabbing within existing B+ index, we have passed in a pointer parameter called transaction with default value nullptr. You can safely ignore the parameter for Checkpoint #1; you do not need to change or call any functions relate to the parameter until Task #4.

Checkpoint #1

Task #1 - B+Tree Pages

You need to implement three Page classes to store the data of your B+Tree tree.

B+Tree Parent Page

This is the parent class that both the Internal Page and Leaf Page inherit from. The Parent Page only contains information that both child classes share. The Parent Page is divided into several fields as shown by the table below.

B+Tree Parent Page Content

Variable Name Size Description
page_type_ 4 Page Type (internal or leaf)
lsn_ 4 Log sequence number (Used in Project 4)
size_ 4 Number of Key & Value pairs in page
max_size_ 4 Max number of Key & Value pairs in page
parent_page_id_ 4 Parent Page Id
page_id_ 4 Self Page Id

You must implement your Parent Page in the designated files. You are only allowed to modify the header file (src/include/storage/page/b_plus_tree_page.h) and its corresponding source file (src/storage/page/b_plus_tree_page.cpp).

B+Tree Internal Page

An Internal Page does not store any real data, but instead it stores an ordered m key entries and m+1 child pointers (a.k.a page_id). Since the number of pointers does not equal the number of keys, the first key is set to be invalid, and lookup methods should always start with the second key. At any time, each internal page is at least half full. During deletion, two half-full pages can be joined to make a legal one or can be redistributed to avoid merging, while during insertion one full page can be split into two. This is an example of one of the many design choices that you will make in the implementation of the B+ Tree.

You must implement your Internal Page in the designated files. You are only allowed to modify the header file (src/include/storage/page/b_plus_tree_internal_page.h) and its corresponding source file (src/storage/page/b_plus_tree_internal_page.cpp).

B+Tree Leaf Page

The Leaf Page stores an ordered m key entries and m value entries. In your implementation, value should only be 64-bit record_id that is used to locate where actual tuples are stored, see RID class defined under in src/include/common/rid.h. Leaf pages have the same restriction on the number of key/value pairs as Internal pages, and should follow the same operations of merge, redistribute and split.

You must implement your Internal Page in the designated files. You are only allowed to modify the header file (src/include/storage/page/b_plus_tree_leaf_page.h) and its corresponding source file (src/storage/page/b_plus_tree_leaf_page.cpp).

Important:Even though the Leaf Pages and Internal Pages contain the same type of key, they may have distinct type of value, thus the max_size of leaf and internal pages could be different.

Each B+Tree leaf/internal page corresponds to the content (i.e., the data_ part) of a memory page fetched by buffer pool. So every time you try to read or write a leaf/internal page, you need to first fetch the page from buffer pool using its unique page_id, then reinterpret cast to either a leaf or an internal page, and unpin the page after any writing or reading operations.

Task #2 - B+Tree Data Structure

Your B+Tree Index should only support unique keys. That is to say, when you try to insert a key-value pair with duplicate keys into the index, it should not perform the insertion and return false. Your B+Tree Index must also correctly perform merge or redistribute (called 'coalescing' in the textbook) if deletion cause certain page to go below the occupancy threshold.

For Checkpoint #1, your B+Tree Index is only required to support insertions (Insert()), point search (GetValue()), and deletes (Delete()). You should correctly perform splits if insertion triggers the splitting condition (number of key/value pairs AFTER insertion equals to max_size for leaf nodes, number of children BEFORE insertion equals to max_size for internal nodes.). Since any write operation could lead to the change of root_page_id in B+Tree index, it is your responsibility to update root_page_id in the header page (src/include/storage/page/header_page.h). This is to ensure that the index is durable on disk. Within the BPlusTree class, we have already implemented a function called UpdateRootPageId for you; all you need to do is invoke this function whenever the root_page_id of B+Tree index changes.

Your B+Tree implementation must hide the details of the key/value type and associated comparator, like this:

template <typename KeyType,
          typename ValueType,
          typename KeyComparator>
class BPlusTree{
   // ---
};

These classes are already implemented for you:

  • KeyType: The type of each key in the index. This will only be GenericKey, the actual size of GenericKey is specified and instantiated with a template argument and depends on the data type of indexed attribute.
  • ValueType: The type of each value in the index. This will only be 64-bit RID.
  • KeyComparator: The class used to compare whether two KeyType instances are less/greater-than each other. These will be included in the KeyType implementation files.

Checkpoint #2

Task #3 - Index Iterator

You will build a general purpose index iterator to retrieve all the leaf pages efficiently. The basic idea is to organize them into a single linked list, and then traverse every key/value pairs in specific direction stored within the B+Tree leaf pages. Your index iterator should follow the functionality of Iterator defined in C++17, including the ability to iterate through a range of elements using a set of operators, and for-each loop (with at least the increment, dereference, equal and not-equal operators). Note that in order to support for-each loop function for your index, your BPlusTree should correctly implements begin() and end().

You must implement your index iterator in the designated files. You are only allowed to modify the header file (src/include/storage/index/index_iterator.h) and its corresponding source file (src/index/storage/index_iterator.cpp). You do not need to modify any other files. You must implement the following functions in the IndexIterator class found in these files. In the implementation of index iterator, you are allowed to add any helper methods as long as you have those three methods below.

  • isEnd(): Return whether this iterator is pointing at the last key/value pair.
  • operator++(): Move to the next key/value pair.
  • operator*(): Return the key/value pair this iterator is currently pointing at.
  • operator==(): Return whether two iterators are equal
  • operator!=(): Return whether two iterators are not equal.

Task #4 - Concurrent Index

In this part, you need to update your original single-threaded B+Tree index so that it can support concurrent operations. We will use the latch crabbing technique described in class and in the textbook. The thread traversing the index will acquire then release latches on B+Tree pages. A thread can only release latch on a parent page if its child page considered "safe". Note that the definition of "safe" can vary based on what kind of operation the thread is executing:

  • Search: Starting with root page, grab read (R) latch on child Then release latch on parent as soon as you land on the child page.
  • Insert: Starting with root page, grab write (W) latch on child. Once child is locked, check if it is safe, in this case, not full. If child is safe, release all locks on ancestors.
  • Delete: Starting with root page, grab write (W) latch on child. Once child is locked, check if it is safe, in this case, at least half-full. (NOTE: for root page, we need to check with different standards) If child is safe, release all locks on ancestors.

Important:The write up only describe the basic concepts behind latch crabbing, before you start your implementation, please consult with lecture and textbook Chapter 15.10.

Road Map

There are several ways in which you could go about building a B+Tree Index. This roadmap serves as only a rough conceptual guideline for building one. This roadmap is based on the algorithm outlined in the textbook. You could choose to ignore parts of the roadmap and still end up with a semantically correct B+Tree that passes all our tests. The choice is entirely yours.

  • Simple Inserts: Given a key-value pair KV and a non-full node N, insert KV into N. Self check: What are the different types of nodes and can key-values be inserted in all of them?
  • Tree Traversals: Given a key K, define a traversal mechanism on the tree to determine the presence of the key. Self check: Can keys exist in multiple nodes and are all these keys the same?
  • Simple Splits: Given a key K, and a target leaf node L that is full, insert the key into the tree, while keeping the tree consistent. Self check: When do you choose to split a node and how do define a split?
  • Multiple Splits: Define inserts for a key K on a leaf node L that is full, whose parent node M is also full. Self check: What happens when the parent of M is also full?
  • Simple Deletes: Given a key K and a target leaf node L that is at-least half full, delete K from L. Self check: Is the leaf node L the only node that contains the key K?
  • Simple Coalesces: Define deletes for a key K on a leaf node L that is less than half-full after the delete operation. Self check: Is it mandatory to coalsesce when L is less than half-full and how do you choose which node to coalesce with?
  • Not-So-Simple Coalesces: Define deletes for a key K on a node L that contains no suitable node to coalesce with. Self check: Does coalescing behavior vary depending on the type of nodes? This should take you through to Checkpoint 1
  • Index Iterators The section on Task 3 describes the detailed implementation of an iterator for the B+Tree.
  • Concurrent Indices The section on Task 4 describes the detailed implementation of the latch crabbing technique to introduce concurrency support into your design.

Requirements and Hints

  • You are not allowed to use a global scope latch to protect your data structure. In other words, you may not lock the whole index and only unlock the latch when operations are done. We will check grammatically and manually to make sure you are doing the latch crabbing in the right way.
  • We have provided the implementation of read-write latch (src/include/common/rwlatch.h). And have already added helper functions under page header file to acquire and release latch (src/include/storage/page/page.h).
  • We will not add any mandatory interfaces in the B+Tree index. You can add any function in your implementation as long as you keep all the original public interfaces intact for test purpose.
  • Don't use malloc/new to allocate large blocks of memory for your tree. If you need to need to create a new node for your tree or need a buffer for some operation, you should use the buffer pool.
  • For this task, you have to use the passed in pointer parameter called transaction (src/include/concurrency/transaction.h). It provides methods to store the page on which you have acquired latch while traversing through B+ tree and also methods to store the page which you have deleted during Remove operation. Our suggestion is to look closely at the FindLeafPage method within B+ tree, you may wanna modify your previous implementation (note that you may need to change to return value for this method) and then add the logic of latch crabbing within this particular method.
  • The return value for FetchPage() in buffer pool manager is a pointer that points to a Page instance (src/include/storage/page/page.h). You can grab a latch on Page, but you cannot grab a latch on B+Tree node (neither internal node nor leaf node).

Common Pitfalls

  • You are not tested for thread-safe scans in this project (no concurrent iterator operations will be tested). However, a correct implementation would requires the Leaf Page to throw an std::exception when it cannot acquire a latch on its sibling to avoid potential dead-locks.
  • Think carefully about the order and relationship between UnpinPage(page_id, is_dirty) method from buffer pool manager class and UnLock() methods from page class. You have to release the latch on that page before you unpin the same page from the buffer pool.
  • If you are implementing concurrent B+tree index correctly, then every thread will always acquire latch from root to bottom. When you release the latch, make sure you follow the same order (a.k.a from root to bottom).
  • One of the corner case is that when insert and delete, the member variable root_page_id (src/include/storage/index/b_plus_tree.h) will also be updated. It is your responsibility to protect from concurrent update of this shared variable(hint: add an abstract layer in B+ tree index, you can use std::mutex to protect this variable)

Instructions

See the Project #0 instructions on how to create your private repository and setup your development environment.

Checkpoint #1

Update your project source code like you did in Project #1.

Be sure to include your Project #1 code into the submission.

Replace the following files in the Project #2 repository with your implementations:

  • src/include/buffer/lru_k_replacer.h
  • src/buffer/lru_k_replacer.cpp
  • src/include/buffer/buffer_pool_manager_instance.h
  • src/buffer/buffer_pool_manager_instance.cpp
  • src/include/container/hash/extendible_hash_table.h
  • src/container/hash/extendible_hash_table.h

Make sure you are getting the correct splitting behaviour by checking your solution with b_plus_tree_printer

You are not required to pass the below tests for Checkpoint #1:

  • test/storage/b_plus_tree_concurrent_test.cpp

Checkpoint #2

You will need to pass the below tests for Checkpoint #2:

  • test/storage/b_plus_tree_insert_test.cpp
  • test/storage/b_plus_tree_delete_test.cpp
  • test/storage/b_plus_tree_concurrent_test.cpp

Testing

You can test the individual components of this assigment using our testing framework. We use GTest for unit test cases. Within tools/b_plus_tree_printer is a tool that you can use to print out the internal data structure of your B+ Tree index. You might find it intuitive to track and find early mistakes.

You can compile and run each test individually from the command-line:

$ mkdir build
$ cd build
$ make b_plus_tree_insert_test -j$(nproc)
$ ./test/b_plus_tree_insert_test

You can also run make check-tests to run ALL of the test cases. Note that some tests are disabled as you have not implemented future projects. You can disable tests in GTest by adding a DISABLED_ prefix to the test name.

Important: These tests are only a subset of the all the tests that we will use to evaluate and grade your project. You should write additional test cases on your own to check the complete functionality of your implementation.

Tree Visualization

Use the b_plus_tree_printer tool to generate a dot file after constructing a tree:

$ # To build the tool
$ mkdir build
$ cd build
$ make b_plus_tree_printer -j$(nproc)
$ ./bin/b_plus_tree_printer
>> ... USAGE ...
>> 5 5 // set leaf node and internal node max size to be 5
>> f input.txt // Insert into the tree with some inserts 
>> g my-tree.dot // output the tree to dot format 
>> q // Quit the test (Or use another terminal) 

You should now have a my-tree.dot file with the DOT file format in the same directory as your test binary, you can then visualize it through either a command line visiualizer or an online visualizer:

  1. Dump the content to http://dreampuf.github.io/GraphvizOnline/.
  2. Or download a command line tool at https://graphviz.org/download/ on your platform.
  3. Draw the tree with dot -Tpng -O my-tree.dot, a PNG file will be generated.

Consider the following example generated with GraphvizOnline:

  • Rectangles in Pink represent internal nodes, those in Green represent leaf nodes.
  • The first row P=3 tells you the page id of the tree page.
  • The second row prints out the properties.
  • The third row prints out keys, and pointers from internal node to the leaf node is constructed from internal node's value.
  • Note that the first box of the internal node is empty. This is not a bug.

You can also compare with our reference solution running in your browser.

Contention Benchmark

To ensure you are implementing lock crabbing correctly, we will use contention benchmark to collect some heuristic from your implementation and manually review your code based on the heuristic. Contention ratio is the slowdown when the B+ tree is used in a multi-thread environment compared with single-thread environment. Generally, the contention ratio should be in the range [2.5, 3.5] on Gradescope. A contention ratio < 2.5 generally means that your lock crabbing is incorrect (e.g, potentially hold some global locks, hold some locks for unnecessary long time), and TAs will manually review your code and apply correctness deduction. A contention ratio > 3.5 means that the lock contention is too high in your implementation and you are recommended to investigating what is happening.

Development Hints

Instead of using printf statements for debugging, use the LOG_* macros for logging information like this:

LOG_DEBUG("Fetching page %d", page_id);

To enable logging in your project, you will need to reconfigure it like this:

cd build
cmake -DCMAKE_BUILD_TYPE=DEBUG ..
make

The different logging levels are defined in src/include/common/logger.h. After enabling logging, the logging level defaults to LOG_LEVEL_INFO. Any logging method with a level that is equal to or higher than LOG_LEVEL_INFO (e.g., LOG_INFO, LOG_WARN, LOG_ERROR) will emit logging information.

Using assert to force check the correctness of your index implementation. For example, when you try to delete a page, the page_count must equals to 0.

Using a relatively small value of page size at the beginning test stage, it would be easier for you to check whether you have done the delete and insert in the correct way. You can change the page size in configuration file (src/include/common/config.h).

Post all of your questions about this project on Piazza. Do not email the TAs directly with questions.

Grading Rubric

Each project submission will be graded based on the following criteria:

  1. Does the submission successfully execute all of the test cases and produce the correct answer?
  2. Does the submission execute without any memory leaks?

Note that we will use additional test cases that are more complex and go beyond the sample test cases that we provide you.

Late Policy

See the late policy in the syllabus.

Submission

After completing the assignment, you can submit your implementation of to Gradescope. Since we have two checkpoints for this project, you will need to submit them separately through the following link.

You only need to include the following files:

Checkpoint #1

  • Every file for Project 1 (6 in total)
  • src/include/storage/page/b_plus_tree_page.h
  • src/storage/page/b_plus_tree_page.cpp
  • src/include/storage/page/b_plus_tree_internal_page.h
  • src/storage/page/b_plus_tree_internal_page.cpp
  • src/include/storage/page/b_plus_tree_leaf_page.h
  • src/storage/page/b_plus_tree_leaf_page.cpp
  • src/include/storage/index/b_plus_tree.h
  • src/storage/index/b_plus_tree.cpp
  • src/include/storage/index/index_iterator.h
  • src/storage/index/index_iterator.cpp

Checkpoint#2

  • Every file for Project 1 (6 in total)
  • src/include/storage/page/b_plus_tree_page.h
  • src/storage/page/b_plus_tree_page.cpp
  • src/include/storage/page/b_plus_tree_internal_page.h
  • src/storage/page/b_plus_tree_internal_page.cpp
  • src/include/storage/page/b_plus_tree_leaf_page.h
  • src/storage/page/b_plus_tree_leaf_page.cpp
  • src/include/storage/index/b_plus_tree.h
  • src/storage/index/b_plus_tree.cpp
  • src/include/storage/index/index_iterator.h
  • src/storage/index/index_iterator.cpp

Alternatively, running make submit-p2 in your build/ directory will generate a zip archive called project2-submission.zip under your project root directory that you can submit to Gradescope. This command will package the files necessary for both checkpoints. You are not expected to have implemented index iterators and concurrency support for checkpoint 1. You will not be graded on this.

cd build
make submit-p2
cd ..
# project2-submission.zip will be generated here. Submit this.

You can submit your answers as many times as you like and get immediate feedback. Your score will be sent via email to your Andrew account within a few minutes after your submission.

Collaboration Policy

  • Every student has to work individually on this assignment.
  • Students are allowed to discuss high-level details about the project with others.
  • Students are not allowed to copy the contents of a white-board after a group meeting with other students.
  • Students are not allowed to copy the solutions from another colleague.

WARNING: All of the code for this project must be your own. You may not copy source code from other students or other sources that you find on the web. Plagiarism will not be tolerated. See CMU's Policy on Academic Integrity for additional information.