Project #2 - Hash Table

Do not post your project on a public Github repository.

Overview

The second programming project is to implement a hash table for the BusTub DBMS that is backed by disk storage. Your hash table is responsible for fast data retrieval without having to search through every record in a database table.

You will need to implement a hash table using the linear probing hashing scheme. It is a giant slot array that spans across multiple pages where each block page holds a hash table block. The table will access pages through your buffer pool from Project #1. The table contains a header page to map blocks to pages. Your hash table needs to support growing in size if it runs out of space (i.e., every slot is full).

You will need to complete the following tasks in your hash table implementation:

This is a single-person project that will be completed individually (i.e., no groups).

  • Release Date: Sep 30, 2019
  • Due Date: Oct 20, 2019 @ 11:59pm

Project Specification

Like the first project, we are providing you with stub classes that contain the API that you need to implement. You should not modify the signatures for the pre-defined functions in these classes. If you do this, then it will break the test code that we will use to grade your assignment you end up getting no credit for the project. If a class already contains certain member variables, you should not remove them. But you may add private helper functions/member variables to these classes in order to correctly realize the functionality.

The correctness of the linear probe hash table index depends on the correctness of your buffer pool implementation. We will not provide solutions for the previous programming projects.

Task #1 - Page Layouts

Your hash table is meant to be accessed through the DBMS's BufferPoolManager. This means that you cannot allocate memory to store information. Everything must be stored in disk pages so that they can read/written from the DiskManager. If you create a hash table, write its pages to disk, and then restart the DBMS, you should be able to load back the hash table from disk after restarting.

To support reading/writing hash table blocks on top of pages, you will implement two Page classes to store the data of your hash table. This is meant to teach you how to allocate memory from the BufferPoolManager as pages.

Hash Table Header Page

This class holds all of the meta-data for the hash table. It is divided into the fields as shown by the table below:

Variable Name Size Description
page_id_ 4 bytes Self Page Id
size_ 4 bytes Number of Key & Value pairs the hash table can hold
next_ind_ 4 bytes The next index to add a new entry to block_page_ids_
lsn_ 4 bytes Log sequence number (Used in Project 4)
block_page_ids_ 4080 bytes Array of block page_id_t

The block_page_ids_ array maps block ids to page_id_t ids. The ith element in block_page_ids_ is the page_id for the ith block.

You must implement your Hash Table Header Page in the designated files. You are only allowed to modify the header file (src/include/page/hash_table_header_page.h) and its corresponding source file (src/page/hash_table_header_page.cpp).

Hash Table Block Page

The Hash Table Block Page holds three arrays:

  • occupied_ : The ith bit of occupied_ is 1 if the ith index of array_ has ever been occupied.
  • readable_ : The ith bit of readable_ is 1 if the ith index of array_ holds a readable value.
  • array_ : The array that holds the key-value pairs.

The number of slots available in a Hash Table Block Page depends on the types of the keys and values being stored. You only need to support fixed-length keys and values. The size of keys/values will be the same within a single hash table instance, but you cannot assume that they will be the same for all instances (e.g., hash table #1 can have 32-bit keys and hash table #2 can have 64-bit keys).

You must implement your Hash Table Block Page in the designated files. You are only allowed to modify the header file (src/include/page/hash_table_block_page.h) and its corresponding source file (src/page/hash_table_block_page.cpp).

Each Hash Table Header/Block page corresponds to the content (i.e., the byte array data_) of a memory page fetched by buffer pool. Every time you try to read or write a page, you need to first fetch the page from buffer pool using its unique page_id, then reinterpret cast to either a header or a block page, and unpin the page after any writing or reading operations.

Task #2 - Hash Table Implementation

You will implement a hash table that uses the linear probing hashing scheme. It needs to support insertions (Insert), point search (GetValue), and deletions (Remove).

Your hash table must support both unique and non-unique keys. Duplicate values for the same key are not allowed. This means that (key_0, value_0) and (key_0, value_1) can exist in the same hash table, but not (key_0, value_0) and (key_0, value_0). The Insert method only returns false if it tries to insert an existing key-value pair.

Your hash table implementation must hide the details of the key/value type and associated comparator, like this:

template <typename KeyType,
          typename ValueType,
          typename KeyComparator>
class LinearProbeHashTable {
   // ---
};

These classes are already implemented for you:

  • KeyType: The type of each key in the hash table. This will only be GenericKey, the actual size of GenericKey is specified and instantiated with a template argument and depends on the data type of indexed attribute.
  • ValueType: The type of each value in the hash table. This will only be 64-bit RID.
  • KeyComparator: The class used to compare whether two KeyType instances are less/greater-than each other. These will be included in the KeyType implementation files.

Note that, to compare whether two ValueType instances are equal to each other, you can use the == operator.

Task #3 - Table Resizing

The linear probing hashing scheme uses a fized-size table. When the hash table is full, then any insert operation will get stuck in an infinite loop because the system will walk through the entire slot array and not find a free space. If your hash table detects that it is full, then it must resize itself to be twice the current size (i.e., if currently has n slots, then the new size will be 2×n).

Since any write operation could lead to a change of header_page_id in your hash table, it is your responsibility to update header_page_id in the header page (src/include/page/header_page.h) to ensure that the container is durable on disk.

Task #4 - Concurrency Control

Up to this this point you could assume that your hash table only supported single-threaded execution. In this last task, you will modify your implementation so that it supports multiple threads reading/writing the table at the same time.

You will need to have latches on each block so that when one thread is writing to a block other threads are not reading or modifying that index as well. You should also allow multiple readers to be reading the same block at the same time.

You will need to latch the whole hash table when you need to resize. When resize is called, the size that the table was when resize was called is passed in as an argument. This is so that if the table was resized while a thread was waiting for the latch, it can immediately give up the latch and attempt insertion again.

REQUIREMENTS AND HINTS

  • You are not allowed to use a global scope latch to protect your data structure for each operation. In other words, you may not lock the whole container and only unlock the latch when operations are done.
  • We are providing you a ReaderWriterLatch (src/include/common/rwlatch.h) that you can use in your hash table. There are also helper functions in the Page base class to acquire and release latches (src/include/page/page.h).
  • Sone of your hash table functions will be given a Transaction (src/include/concurrency/transaction.h) object. This object provides methods to store the page on which you have acquired latch while traversing through the Hash Table.

Instructions

Creating Your Own Project Repository

Go to the BusTub GitHub page and click the Fork button to create a copy of the repository under your account.

You can then clone your private repository on your local machine in the filesystem, say under the ~/git directory. If you have not used git before, here's some information on source code management using git.

Setting Up Your Development Environment

To build the system from the commandline, execute the following commands:

$ mkdir build
$ cd build
$ cmake ..
$ make

To speed up the build process, you can use multiple threads by passing the -j flag to make. For example, the following command will build the system using four threads:

$ make -j 4

Testing

You can test the individual components of this assignment using our testing framework. We use GTest for unit test cases. There are two separate files that contain tests for each component:

  • Pages: test/container/hash_table_page_test.cpp
  • Hash Table: test/container/hash_table_test.cpp

You can compile and run each test individually from the command-line:

$ mkdir build
$ cd build
$ make hash_table_test
$ ./test/hash_table_test

You can also run make check-tests to run ALL of the test cases. Note that some tests are disabled as you have not implemented future projects. You can disable tests in GTest by adding a DISABLED_ prefix to the test name.

Important: These tests are only a subset of the all the tests that we will use to evaluate and grade your project. You should write additional test cases on your own to check the complete functionality of your implementation.

Formatting

Your code must follow the Google C++ Style Guide. We use Clang to automatically check the quality of your source code. Your project grade will be zero if your submission fails any of these checks.

Execute the following commands to check your syntax. The format target will automatically correct your code. The check-lint and check-clang-tidy targets will print errors and instruct you how to fix it to conform to our style guide.

$ make format
$ make check-lint
$ make check-clang-tidy

Development Hints

Instead of using printf statements for debugging, use the LOG_* macros for logging information like this:

LOG_INFO("# Pages: %d", num_pages);
LOG_DEBUG("Fetching page %d", page_id);

To enable logging in your project, you will need to reconfigure it like this:

$ mkdir build
$ cd build
$ cmake -DCMAKE_BUILD_TYPE=DEBUG ..
$ make

The different logging levels are defined in src/include/common/logger.h. After enabling logging, the logging level defaults to LOG_LEVEL_INFO. Any logging method with a level that is equal to or higher than LOG_LEVEL_INFO (e.g., LOG_INFO, LOG_WARN, LOG_ERROR) will emit logging information. Note that you will need to add #include "common/logger.h" to any file that you want to use the logging infrastructure.

We encourage you to use gdb to debug your project if you are having problems.

Post all of your questions about this project on Piazza. Do not email the TAs directly with questions.

Grading Rubric

Each project submission will be graded based on the following criteria:

  1. Does the submission successfully execute all of the test cases and produce the correct answer?
  2. Does the submission execute without any memory leaks?
  3. Does the submission follow the code formatting and style policies?

Note that we will use additional test cases to grade your submission that are more complex than the sample test cases that we provide you.

Late Policy

See the late policy in the syllabus.

Submission

After completing the assignment, you can submit your implementation to Gradescope:

You need to include the following files (all files from this project and Project 1):

  • src/include/storage/page/hash_table_header_page.h
  • src/storage/page/hash_table_header_page.cpp
  • src/include/storage/page/hash_table_block_page.h
  • src/storage/page/hash_table_block_page.cpp
  • src/include/container/hash/linear_probe_hash_table.h
  • src/container/hash/linear_probe_hash_table.cpp
  • src/include/buffer/clock_replacer.h
  • src/buffer/clock_replacer.cpp
  • src/include/buffer/buffer_pool_manager.h
  • src/buffer/buffer_pool_manager.cpp

You can submit your answers as many times as you like and get immediate feedback.

Collaboration Policy

  • Every student has to work individually on this assignment.
  • Students are allowed to discuss high-level details about the project with others.
  • Students are not allowed to copy the contents of a white-board after a group meeting with other students.
  • Students are not allowed to copy the solutions from another colleague.

WARNING: All of the code for this project must be your own. You may not copy source code from other students or other sources that you find on the web. Plagiarism will not be tolerated. See CMU's Policy on Academic Integrity for additional information.