Project #2 - Extendible Hash Index
Do not post your project on a public Github repository.
Overview
In this programming project you will implement disk-backed hash index in your database system. You will be using a variant of extendible hashing as the hashing scheme. Unlike the two-level scheme taught in class, we added a non-resizable header page on top of the directory pages so that the hash table can hold more values and potentially achieve better multi-thread performance.
The following diagram shows an extendible hash table with a header page of max depth 2, directory pages with max depth 2, and bucket pages holding at most two entries. The values are omitted, and the hash of the keys are shown in the bucket pages instead of the key themselves.
The index provides fast data retrieval without needing to search every row in a database table enabling rapid random lookups. Your implementation should support thread-safe search, insertion, and deletion (including growing/shrinking the directory and splitting/mergeing buckets).
You must implement the following tasks:
This is a single-person project that must be completed individually (i.e. no groups).
- Release Date: Feb 19, 2024
- Due Date: Mar 15, 2024 @ 11:59pm
Before starting, run git pull public master
to pull the latest code from the public BusTub repo.
Remember to pull latest code from the bustub repository.
Your work here depends on your implementation of the buffer pool from Project 1. If your Project 1 solution was incorrect, you must fix it to successfully complete this project. We will not provide solutions for the previous programming projects.
Project Specification
Like the first project, we are providing stub classes that contain the API that you need to implement. You should not modify the signatures for the pre-defined functions in these classes. If you modify the signatures, our grading test code will not work and you will get no credit for the project.
If a class already contains data members, you should not remove them. These are required to implement functionality needed by the rest of the system. Unless specified otherwise, you may add data members and helper functions to these classes to correctly implement the required functionality.
You may use any built-in C++17 containers in your project unless specified otherwise. It is up to you to decide which ones you want to use. Be warned that these containers are not thread-safe; will need to use latches to protect access to them. You may not use additional third-party libraries (e.g. boost).
Task #1 - Extendible Hash Table Pages
You must implement three Page classes to store the data of your Extendible Hash Table.
Hash Table Header Page
The header page sits the at the first level of our disk-based extendible hash table, and there is only one header page for a hash table. It stores the logical child pointers to the directory pages (as page ids). You can think about it as a static first-level directory page. The header page has the following fields:
Variable Name | Size | Description |
---|---|---|
directory_page_ids_ |
2048 | An array of directory page ids |
max_depth_ |
4 | The maximum depth the header page could handle |
Note that although there is a physical limit of how large a page is, you should use max_depth_
to determine the upper bound of your directory_page_ids
array size.
You must implement the extendible hash table header page by modifying only its header file (src/include/storage/page/extendible_htable_header_page.h) and corresponding source file (src/storage/page/extendible_htable_header_page.cpp).
Hash Table Directory Page
Directory pages sit at the second level of our disk-based extendible hash table. Each of them stores the logical child pointers to the bucket pages (as page ids), as well as metadata for handling bucket mapping and dynamic directory growing and shrinking. The directory page has the following fields:
Variable Name | Size | Description |
---|---|---|
max_depth_ |
4 | The maximum depth the header page could handle |
global_depth_ |
4 | The current directory global depth |
local_depths_ |
512 | An array of bucket page local depths |
bucket_page_ids_ |
2048 | An array of bucket page ids |
Note that although there is a physical limit of how large a page is, you should use max_depth_
to determine the upper bound of your bucket_page_ids_
array size.
You must implement the extendible hash table directory page by modifying only its header file (src/include/storage/page/extendible_htable_directory_page.h) and corresponding source file (src/storage/page/extendible_htable_directory_page.cpp).
Hash Table Bucket Page
Bucket pages sit at the third level of our disk-based extendible hash table. They are the ones that are actually storing the key-value pairs. The bucket page has the following fields:
Variable Name | Size | Description |
---|---|---|
size_ |
4 | The number of key-value pairs the bucket is holding |
max_size_ |
4 | The maximum number of key-value pairs the bucket can handle |
array_ |
less than or equal to 4088 | An array of key-value pairs |
Note that although there is a physical limit of how large a page is, you should use max_size_
to determine the upper bound of your array_
key-value pairs array size.
You must implement the extendible hash table bucket page by modifying only its header file (src/include/storage/page/extendible_htable_bucket_page.h) and corresponding source file (src/storage/page/extendible_htable_bucket_page.cpp).
Each extendible hash table header/directory/bucket page corresponds to the content (i.e., the data_
part) of a memory page fetched by the buffer pool. Every time you read or write a page, you must first fetch the page from the buffer pool (using its unique page_id
), reinterpret cast it the corresponding type, and unpin the page after reading or writing it. We strongly encourage you to take advantage of the PageGuard
APIs you implemented in Project #1 to achieve this.
Task #2 - Extendible Hashing Implementation
Your implementation needs to support insertions, point search and deletions. There are many helper functions either implemented or documented the extendible hash table's header and cpp files. Your only strict API requirement is adhering to Insert
, GetValue
, and Remove
. You also must leave the VerifyIntegrity
function as it is. Please feel free to design and implement additional functions as you see fit.
For this semester, the hash table is intend to support only unique keys. This means that the hash table should return false if the user tries to insert duplicate keys.
Note: You should use the page classes you implemented in Task #1 to store the key-value pairs as well as the metadata to maintain the hash table (page ids, global/local depths). For instance, you should not use in-memory data structure such as a std::unordered_map
to mock the hash table.
The extendible hash table is parameterized on arbitrary key, value, and key comparator types.
template <typename KeyType, typename ValueType, typename KeyComparator>
The type parameters are:
KeyType
: The type of each key in the index. In practice this will be aGenericKey
. The actual size of aGenericKey
varies, and is specified with its own template argument that depends on the type of indexed attribute.ValueType
: The type of each value in the index. In practice, this will be a 64-bit RID.KeyComparator
: A class used to compare whether twoKeyType
instances are less than, greater than, or equal to each other. These will be included in theKeyType
implementation files.
Note: Our hash table functions also take a Transaction*
with default value nullptr
. This is intended for project 4 if you want to implement concurrent index lookup in concurrency control. You generally don't need to use it in this project.
You must implement the extendible hash table by modifying only its header file (src/include/container/disk/hash/disk_extendible_hash_table.h) and corresponding source file (src/container/disk/hash/disk_extendible_hash_table.cpp).
This project requires you to implement bucket splitting/merging and directory growing/shrinking. The following subsections provide a specification on the implementation details.
Empty Table
When you first create an empty hash table, it should only have the (one and only) header page. Directory pages and bucket pages should be created on demand.
Header Indexing
You will want to use the most-significant bits for indexing into the header page's directory_page_ids_
array. This involves taking the hash of your key and perform bit operations with the depth of the header page. The header page depth will not change.
Directory Indexing
You will want to use the least-significant bits for indexing into the directory page's bucket_page_ids_
array. This involves taking the hash of your key and perform bit operations with the current depth of the directory page.
Bucket Splitting
You must split a bucket if there is no room for insertion. Ensure you support recursive splitting, in the case the new bucket after a split does not have enough space, unless max_depth_
is reached. You can ostensibly split as soon as the bucket becomes full, if you find that easier. However, the reference solution splits only when an insertion would overflow a page. Hence, you may find that the provided API is more amenable to this approach. As always, you are welcome to factor your own internal API.
Bucket Merging
Merging must be attempted when a bucket becomes empty. There are ways to merge more aggressively by checking the occupancy of buckets and their split images, but these expensive checks and extra merges can increase thrashing.
To keep things relatively simple, we provide the following rules for merging:
- Only empty buckets can be merged.
- Buckets can only be merged with their split image if their split image has the same local depth.
- You should keep merging recursively if the new split image of the merged bucket is empty.
If you are confused about a "split image,” please review the algorithm and code documentation. The concept falls out quite naturally.
Directory Growing
There are no fancy rules for part of the hash table. You either have to grow the directory, or you do not.
Directory Shrinking
Only shrink the directory if the local depth of every bucket is strictly less than the global depth of the directory.
Task #3 - Concurrency Control
Finally, modify your extendible hash table implementation so that it safely supports concurrent operations with multiple threads. The thread traversing the index should acquire latches on hash table pages as necessary to ensure safe concurrent operations, and should release latches on parent pages as soon as it is possible to determine that it is safe to do so.
We recommend that you complete this task by using the FetchPageWrite
or FetchPageRead
buffer pool API, depending on whether you want to access a page with read or write privileges. Then modify your implementation to grab and release read and write latches as necessary to implement the latch crabbing algorithm.
Note: You should never acquire the same read lock twice in a single thread. It might lead to deadlock.
Note: You should make careful design decisions on latching. Always holding a global latch the entire hash table is probably not a good idea. TAs will manual review your implementation and bad latching design would result in points deduction.
Leaderboard Task (Optional)
For this project's leaderboard challenge, we are doing a benchmark on your hash table with a very large buffer pool.
Optimizing for the leaderboard is optional (i.e., you can get a perfect score in the project after finishing all previous tasks). However, your solution must finish the leaderboard test with a correct result and without deadlock and segfault.
The leaderboard test is compiled with the release profile:
mkdir cmake-build-relwithdebinfo cd cmake-build-relwithdebinfo cmake .. -DCMAKE_BUILD_TYPE=RelWithDebInfo make -j`nproc` htable-bench ./bin/bustub-htable-bench
We strongly recommend you to checkpoint your code before optimizing for leaderboard tests, so that if these optimizations cause problems in future projects, you can always revert them.
In the leaderboard test, we will have multiple threads accessing the hash table. There are two types of threads running in the benchmark:
- Read threads. Each thread will get values for a range of keys.
- Write threads. Each thread will insert and remove keys from the hash table.
The final score is computed as a geometric average of read and write qps (throughput).
Recommended Optimizations
- Optimistic latching. For instance, during insertion, instead of taking the write lock on the header page before determining whether a new directory needs to be added, you can optimistically take the read lock on the header page, check whether a new directory needs to be added, and only obtain the write lock on the header page if you need to add a new directory. Similar case for locking on directory pages.
- Bloom filters stored in the header and/or directory pages. You are not allowed to create a separate page to store a bloom filter.
- If your bloom filter functions need input keys of type
GenericKey
which comes in variable sizes, template functions may be useful. For reference on the use oftemplate
, see src/storage/page/extendible_htable_bucket_page.cpp. - To derive your k hash functions, please refer to third_party/murmur3, the only supported third-party hashing library.
- If your bloom filter functions need input keys of type
Note: We recommend starting with optimistic latching, as this will likely yield the largest improvement in performance. You might find it difficult to achieve good performance by implementing both optimistic latching and bloom filters completely, so you may choose to stick with one approach or experiment with a mix of both.
Leaderboard Policy
- Submissions with leaderboard bonus are subject to manual review by TAs.
- By saying "review", it means that TAs will manually look into your code, or if they are unsure whether an optimization is correct or not by looking, they will make simple modification to existing test cases to see if your leaderboard optimization correctly handled the specific cases that you want to optimize.
- One example of simple modification: change the buffer pool manager size for the benchmark.
- Your optimization should not affect correctness. You can optimize for specific cases, but it should work for all inputs in your optimized cases.
- You should not try detecting whether your submission is running leaderboard test by using side information.
- Unless we allow you to do so.
- Disallowed: use
#ifdef NDEBUG
, etc.
- Submissions with obvious correctness issues will not be assigned leaderboard bonus.
- You cannot use late days for leaderboard tests.
- If you are unsure about whether an optimization is reasonable, you should post on Piazza or visit any TA's office hour.
Instructions
See the Project #0 instructions on how to create your private repository and setup your development environment.
Testing
You can test the individual components of this assigment using our testing framework. We use GTest for unit test cases. There are three separate files that contain tests for each component:
Hash Table Pages
: test/storage/extendible_htable_page_test.cppExtendible Hash Table
: test/container/disk/hash/extendible_htable_test.cpp
You can compile and run each test individually from the command-line:
$ make extendible_htable_page_test -j$(nproc) $ ./test/extendible_htable_page_test
You can also run make check-tests
to run ALL of the test cases. Note that some tests are disabled as you have not implemented future projects. You can disable tests in GTest by adding a DISABLED_
prefix to the test name.
Important: These tests are only a subset of the all the tests that we will use to evaluate and grade your project. You should write additional test cases on your own to check the complete functionality of your implementation.
Formatting
Your code must follow the Google C++ Style Guide. We use Clang to automatically check the quality of your source code. Your project grade will be zero if your submission fails any of these checks.
Execute the following commands to check your syntax. The format
target will automatically correct your code. The check-lint
and check-clang-tidy-p2
targets will print errors and instruct you how to fix it to conform to our style guide.
$ make format $ make check-clang-tidy-p2
Memory Leaks
For this project, we use LLVM Address Sanitizer (ASAN) and Leak Sanitizer (LSAN) to check for memory errors. To enable ASAN and LSAN, configure CMake in debug mode and run tests as you normally would. If there is memory error, you will see a memory error report. Note that macOS only supports address sanitizer without leak sanitizer.
In some cases, address sanitizer might affect the usability of the debugger. In this case, you might need to disable all sanitizers by configuring the CMake project with:
$ cmake -DCMAKE_BUILD_TYPE=Debug -DBUSTUB_SANITIZER= ..
Development Hints
You can use BUSTUB_ASSERT
for assertions in debug mode. Note that the statements within BUSTUB_ASSERT
will NOT be executed in release mode.
If you have something to assert in all cases, use BUSTUB_ENSURE
instead.
Post all of your questions about this project on Piazza. Do not email the TAs directly with questions.
We encourage you to use a graphical debugger to debug your project if you are having problems.
If you are having compilation problems, running make clean
does not completely reset the compilation process. You will need to delete your build directory and run cmake ..
again before you rerun make
.
Grading Rubric
Each project submission will be graded based on the following criteria:
- Does the submission successfully execute all of the test cases and produce the correct answer?
- Does the submission execute without any memory leaks?
- Does the submission follow the code formatting and style policies?
Note that we will use additional test cases to grade your submission that are more complex than the sample test cases that we provide you.
Late Policy
See the late policy in the syllabus.
Submission
After completing the assignment, you can submit your implementation to Gradescope:
Running make submit-p2
in your build/
directory will generate a zip
archive called project2-submission.zip
under your project root directory that you can submit to Gradescope.
You can submit your answers as many times as you like and get immediate feedback.
Notes on Gradescope and Autograder
- If you are timing out on Gradescope, it's likely because you have a deadlock in your code or your code is too slow and does not run in 60 seconds. If your code is too slow it may be because you have performance issue on the buffer pool manager you implemented in project 1.
- The autograder will not work if you are printing too many logs in your submissions.
- If the autograder did not work properly, make sure that your formatting commands work and that you are submitting the right files.
CMU students should use the Gradescope course code announced on Piazza.
Collaboration Policy
- Every student has to work individually on this assignment.
- Students are allowed to discuss high-level details about the project with others.
- Students are not allowed to copy the contents of a white-board after a group meeting with other students.
- Students are not allowed to copy the solutions from another colleague.
WARNING: All of the code for this project must be your own. You may not copy source code from other students or other sources that you find on the web. Plagiarism will not be tolerated. See CMU's Policy on Academic Integrity for additional information.