Project #1 - Buffer Pool
Do not post your project on a public Github repository.
During the semester, you will be building a new disk-oriented storage manager for the BusTub DBMS. Such a storage manager assumes that the primary storage location of the database is on disk.
The first programming project is to implement a buffer pool in your storage manager. The buffer pool is responsible for moving physical pages back and forth from main memory to disk. It allows a DBMS to support databases that are larger than the amount of memory that is available to the system. The buffer pool's operations are transparent to other parts in the system. For example, the system asks the buffer pool for a page using its unique identifier (
page_id_t) and it does not know whether that page is already in memory or whether the system has to go retrieve it from disk.
Your implementation will need to be thread-safe. Multiple threads will be accessing the internal data structures at the same and thus you need to make sure that their critical sections are protected with latches (these are called "locks" in operating systems).
You will need to implement the following three components in your storage manager:
This is a single-person project that will be completed individually (i.e. no groups).
- Release Date: Sep 13, 2022
- Due Date: Oct 02, 2022 @ 11:59pm
For each of the following components, we are providing you with stub classes that contain the API that you need to implement. You should not modify the signatures for the pre-defined functions in these classes. If you modify the signatures, the test code that we use for grading will break and you will get no credit for the project.
If a class already contains data members, you should not remove them. For example, the
BufferPoolManagerInstance contains the
LRUKReplacer objects. These are required to implement the functionality that is needed by the rest of the system. On the other hand, you may need to add data members to these classes in order to correctly implement the required functionality. You can also add additional helper functions to these classes. The choice is yours.
You are allowed to use any built-in C++17 containers in your project unless specified otherwise. It is up to you to decide which ones you want to use. Note that these containers are not thread-safe and that you will need to include latches in your implementation to protect them. You may not bring in additional third-party dependencies (e.g. boost).
Task #1 - Extendible Hash Table
For the first part of this project, you will build a general purpose hash table that uses unordered buckets to store unique key/value pairs. Your hash table must support the ability to insert/delete key/value entries without specifying the max size of the table. Your table needs to incrementally grow in size as needed but you do not need shrink it. That is, you do not need to implement support for shrinking or compacting the hash table. You will also need to support checking to see whether a key exists in the hash table and return its corresponding value.
We encourage you to walk through some small examples of split and directory growing cases by hand first before starting implementing.
You must implement your hash table in the designated files in the project source code. You are only allowed to modify the hash table header file (src/include/container/hash/extendible_hash_table.h) and its corresponding implementation file (src/container/hash/extendible_hash_table.cpp). You do not need to modify any other files. You may not use another built-in hash table internally in your implementation. You must implement the following functions in the
Find(K, V): For the given key K, check to see whether it exists in the hash table. If it does, then store the pointer to its corresponding value in V and return true. If the key does not exist, then return false.
Insert(K, V): Insert the key/value pair into the hash table. If the key K already exists, overwrite its value with the new value V and return true. If the key/value pair couldn't be inserted into the bucket (because the bucket is full and the key is not updating an existing pair), do the following steps before retrying:
- If the local depth of the bucket is equal to the global depth, increment the global depth and double the size of the directory.
- Increment the local depth of the bucket.
- Split the bucket and redistribute directory pointers & the kv pairs in the bucket.
Remove(K): For the given key K, remove its corresponding key/value pair from the hash table and return true. If the key K does not exist in the hash table, then return false.
GetGlobalDepth(): Return the current global depth of the entire hash table.
GetLocalDepth(dir_index): Return the current local depth for the bucket which the given directory index points to.
GetNumBuckets(): Return the total number of buckets allocated in the hash table.
You can make use of the provided
IndexOf(K) private function to compute the the directory index that a given key hashes to. In addition, we provide a
Bucket nested class that represents buckets in the extendible hash table. Finishing the TODOs of the
Bucket class first by following the code documentation can make it easier for you to implement the
ExtendibleHashTable APIs. But you can feel free to write your own internal class / helper functions.
You need to make sure that all operations in the hash table are thread-safe using
std::mutex. It is up to you to decide how you want to protect the data structure.
Task #2 - LRU-K Replacement Policy
This component is responsible for tracking page usage in the buffer pool. You will implement a new class called
LRUKReplacer in src/include/buffer/lru_k_replacer.h and its corresponding implementation file in src/buffer/lru_k_replacer.cpp. Note that
LRUKReplacer is a stand-alone class and it is not related to any of the other
Replacer classes. You are expected to implement only the LRU-K replacement policy. You don't have to implement LRU or clock replacement policy, even if there is a corresponding file for it.
The LRU-K algorithm evicts a frame whose backward k-distance is maximum of all frames in the replacer. Backward k-distance is computed as the difference in time between current timestamp and the timestamp of kth previous access. A frame with less than k historical accesses is given +inf as its backward k-distance. When multipe frames have +inf backward k-distance, the replacer evicts the frame with the earliest timestamp.
The maximum size for the
LRUKReplacer is the same as the size of the buffer pool since it contains placeholders for all of the frames in the
BufferPoolManager. However, at any given moment not all the frames in the replacer are considered to be evictable. The size of
LRUKReplacer is represented by the number of evictable frames. The
LRUKReplacer is initialized to have no frames in it. Then, only when a frame is marked as evictable, replacer's size will increase.
You will need to implement the LRU-K policy discussed in the class. You will need to implement the following methods:
Evict(frame_id_t*): Evict the frame with largest backward k-distance compared to all other evictable frames being tracked by the
Replacer. Store the frame id in the output parameter and return
True. If there are no evictable frames return
RecordAccess(frame_id_t): Record that given frame id is accessed at current timestamp. This method should be called after a page is pinned in the
Remove(frame_id_t): Clear all access history associated with a frame. This method should be called only when a page is deleted in the
SetEvictable(frame_id_t, bool set_evictable): This method controls whether a frame is evictable or not. It also controls
LRUKReplacer's size. You'll know when to call this function when you implement the
BufferPoolManager. To be specific, when pin count of a page reaches 0, its corresponding frame is marked evictable and replacer's size is incremented.
Size(): This method returns the number of evictable frames that are currently in the
The implementation details are up to you. You are allowed to use built-in STL containers. You can assume that you will not run out of memory, but you must make sure that the operations are thread-safe.
Task #3 - Buffer Pool Manager Instance
Lastly, you need to implement the buffer pool manager in your system (
BufferPoolManagerInstance is responsible for fetching database pages from the
DiskManager and storing them in memory. The
BufferPoolManagerInstance can also write dirty pages out to disk when it is either explicitly instructed to do so or when it needs to evict a page to make space for a new page.
To make sure that your implementation works correctly with the rest of the system, we will provide you with some of the functions already filled in. You will also not need to implement the code that actually reads and writes data to disk (this is called the
DiskManager in our implementation). We will provide that functionality for you.
All in-memory pages in the system are represented by
Page objects. The
BufferPoolManagerInstance does not need to understand the contents of these pages. But it is important for you as the system developer to understand that
Page objects are just containers for memory in the buffer pool and thus are not specific to a unique page. That is, each
Page object contains a block of memory that the
DiskManager will use as a location to copy the contents of a physical page that it reads from disk. The
BufferPoolManagerInstance will reuse the same
Page object to store data as it moves back and forth to disk. This means that the same
Page object may contain a different physical page throughout the life of the system. The
Page object's identifer (
page_id) keeps track of what physical page it contains; if a
Page object does not contain a physical page, then its
page_id must be set to
Page object also maintains a counter for the number of threads that have "pinned" that page. Your
BufferPoolManagerInstance is not allowed to free a
Page that is pinned. Each
Page object also keeps track of whether it is dirty or not. It is your job to record whether a page was modified before it is unpinned. Your
BufferPoolManagerInstance must write the contents of a dirty
Page back to disk before that object can be reused.
BufferPoolManagerInstance implementation will use the
LRUKReplacer class that you created in the previous steps of this assignment. It will use
ExtendibleHashTable for the table that maps
frame_id. It will also use the
LRUKReplacer to keep track of when
Page objects are accessed so that it can decide which one to evict when it must free a frame to make room for copying a new physical page from disk.
You will need to implement the following functions defined in the header file (src/include/buffer/buffer_pool_manager_instance.h) in the source file (src/buffer/buffer_pool_manager_instance.cpp):
FetchPgImp, you should return nullptr if no page is available in the free list and all other pages are currently pinned.
FlushPgImp should flush a page regardless of its pin status.
UnpinPgImp, the is_dirty parameter keeps track of whether a page was modified while it was pinned.
AllocatePage private method provides the
BufferPoolManager a unique new page id when you want to create a new page in
NewPgImp(). On the other hand, the
DeallocatePage() method is a no-op that imitates freeing a page on the disk and you should call this in your
Finally, refer to the function documentation for details on how to implement these functions and how
BufferPoolManagerInstance interacts with
LRUKReplacer. Don't touch the non-impl versions, we need those to grade your code.
The Disk Manager class (src/include/storage/disk/disk_manager.h) reads and writes the page data from and to the disk. Your buffer pool manager will use
DiskManager::WritePage() whenever it needs to fetch a page to the buffer pool or flush a page to the disk.
Please refer to the header files (extendible_hash_table.h, lru_k_replacer.h, buffer_pool_manager_instance.h) for more detailed specs and documentations.
See the Project #0 instructions on how to create your private repository and setup your development environment.
You can test the individual components of this assigment using our testing framework. We use GTest for unit test cases. There are three separate files that contain tests for each component:
You can compile and run each test individually from the command-line:
$ mkdir build $ cd build $ make extendible_hash_table_test -j$(nproc) $ ./test/extendible_hash_table_test
You can also run
make check-tests to run ALL of the test cases. Note that some tests are disabled as you have not implemented future projects. You can disable tests in GTest by adding a
DISABLED_ prefix to the test name.
Important: These tests are only a subset of the all the tests that we will use to evaluate and grade your project. You should write additional test cases on your own to check the complete functionality of your implementation.
Your code must follow the Google C++ Style Guide. We use Clang to automatically check the quality of your source code. Your project grade will be zero if your submission fails any of these checks.
Execute the following commands to check your syntax. The
format target will automatically correct your code. The
check-clang-tidy-p1 targets will print errors and instruct you how to fix it to conform to our style guide.
$ make format $ make check-lint $ make check-clang-tidy-p1
Instead of using
printf statements for debugging, use the
LOG_* macros for logging information like this:
LOG_INFO("# Pages: %d", num_pages); LOG_DEBUG("Fetching page %d", page_id);
To enable logging in your project, you will need to reconfigure it like this:
$ mkdir build $ cd build $ cmake -DCMAKE_BUILD_TYPE=DEBUG .. $ make
The different logging levels are defined in src/include/common/logger.h. After enabling logging, the logging level defaults to
LOG_LEVEL_INFO. Any logging method with a level that is equal to or higher than
LOG_ERROR) will emit logging information. Note that you will need to add
#include "common/logger.h" to any file that you want to use the logging infrastructure.
We encourage you to use
gdb to debug your project if you are having problems.
Post all of your questions about this project on Piazza. Do not email the TAs directly with questions.
If you are having compilation problems, running
make clean does not completely reset the compilation process. You will need to delete your build directory and run
cmake .. again before you rerun
Each project submission will be graded based on the following criteria:
- Does the submission successfully execute all of the test cases and produce the correct answer?
- Does the submission execute without any memory leaks?
- Does the submission follow the code formatting and style policies?
Note that we will use additional test cases to grade your submission that are more complex than the sample test cases that we provide you.
See the late policy in the syllabus.
After completing the assignment, you can submit your implementation to Gradescope:
You only need to include the following files:
make submit-p1 in your
build/ directory will generate a
zip archive called
project1-submission.zip under your project root directory that you can submit to Gradescope. This is equivelant to running the following
zip command in the project root directory:
$ zip project1-submission.zip \ src/include/container/hash/extendible_hash_table.h \ src/container/hash/extendible_hash_table.cpp \ src/include/buffer/lru_k_replacer.h \ src/buffer/lru_k_replacer.cpp \ src/include/buffer/buffer_pool_manager_instance.h \ src/buffer/buffer_pool_manager_instance.cpp
You can submit your answers as many times as you like and get immediate feedback.
Notes on Gradescope and Autograder
- If you are timing out on Gradescope, it's likely because you have a deadlock in your code or your code is too slow and does not run in 60 seconds. If your code is too slow it may be because your
LRUKReplaceris not efficient enough.
- The autograder will not work if you do not remove logs before submissions.
- If the autograder did not work properly, make sure that your formatting commands work and that you are submitting the right files.
- The leaderboard benchmark score will be calculated by stress testing your
CMU students should use the Gradescope course code announced on Piazza.
- Every student has to work individually on this assignment.
- Students are allowed to discuss high-level details about the project with others.
- Students are not allowed to copy the contents of a white-board after a group meeting with other students.
- Students are not allowed to copy the solutions from another colleague.
WARNING: All of the code for this project must be your own. You may not copy source code from other students or other sources that you find on the web. Plagiarism will not be tolerated. See CMU's Policy on Academic Integrity for additional information.