Project #3 - Query Execution

Do not post your project on a public Github repository.

Overview

The third programming project is to add support for executing queries in your database system. You will implement executors that are responsible for taking query plan nodes and executing them. You will create executors that perform sequential scans, inserts, hash joins, and aggregations. Because the DBMS does not support SQL (yet), your implementation will operate directly on hand-written query plans.

We will use the Iterator query processing model (i.e., the Volcano model). Every query plan executor implements a Next function. When the DBMS invokes an executor's Next function, the executor either returns a single tuple or indicates that there are no more tuples. This lets the executor implement a loop that just keeps calling Next on its children to get and process their tuples.

As optional reading, you may be interested in following a select statement through the PostgreSQL internals.

The project is comprised of the following tasks:

This is a single-person project that will be completed individually (i.e., no groups).

Please post all of your questions about this project on Piazza. Do not email the TAs directly with questions. The instructor and TAs will not debug your code. You may verbally describe test cases on Piazza, but do not post an excessive amount of testing code on Piazza either.

  • Release Date: Oct 21, 2019
  • Due Date: Nov 17, 2019 @ 11:59pm

Project Specification

Like the previous projects, we are providing you with stub classes that contain the API that you need to implement. You should not modify the signatures for the pre-defined functions in these classes. If you do this, then it will break the test code that we will use to grade your assignment and you will end up getting no credit for the project. If a class already contains certain member variables, you should not remove them. You may however add private helper functions and member variables to these classes.

The correctness of this project depends on the correctness of your implementation of previous projects, we will not provide solutions or binary files.

Advice from the TAs

  • Task #1 is much easier than the others. Task #3 is open-ended. Budget your time wisely.
  • Tasks #1 and #2 do not rely on a working linear probe hash table implementation. Focus on getting those first.
  • Conceptual understanding is important for this project. It will make your life much easier.

Task #1 - Creating a Catalog Table

A database maintains an internal catalog to keep track of meta-data about the database. For example, the catalog is used to answer what tables are present and where. For more details, refer back to Lecture #04 - Database Storage (Part II).

In this warm-up task, you will be modifying src/include/catalog/simple_catalog.h to that allow the DBMS to add new tables to the database and retrieve them using either the name or internal object identifier (table_oid_t). You will implement the CreateTable, GetTable(const std::string &table_name), and GetTable(table_oid_t table_oid) methods. You should find these tasks straightforward, the goal is to familiarize yourself with the catalog and how it interacts with the rest of the system.

Testing

There is a sample test in test/catalog/catalog_test.cpp.

Task #2 - Executors

In the second task, you will implement executors for sequential scans, inserts, hash joins, and aggregations. For each query plan operator type, there is a corresponding executor object that implements the Init and Next methods. The Init method is for setting up internal state about the invocation of the operator (e.g., retrieving the corresponding table to scan). The Next method provides the iterator interface that returns a single tuple on each invocation (or null if there are no more tuples).

The executors that you will implement are defined in the following header files:

  • src/include/execution/executors/seq_scan_executor.h
  • src/include/execution/executors/insert_executor.h
  • src/include/execution/executors/hash_join_executor.h
  • src/include/execution/executors/aggregation_executor.h

We assume executors are single-threaded throughout the entire project. You are also free to add private helper functions and class members as you see fit.

To understand how the executors are created at runtime during query execution, refer to the ExecutorFactory (src/include/execution/executor_factory.h) helper class. Moreover, every executor has an ExecutorContext (src/include/execution/executor_context.h) that maintains additional state about the query. Finally, we have provided sample tests for you in test/execution/executor_test.cpp.

Plan nodes are the input format or blueprint for the executors that you will be implementing. Plan nodes and executors are both tree-like; they accept tuples from their children and give tuples to their parent. They may rely on conventions about the ordering of their children, which we will describe below.

Sequential Scans

Sequential scans iterate over a table and return its tuples one-at-a-time. A sequential scan is specified by a SeqScanPlanNode. The plan node specifies which table to iterate over. The plan node may also contain a predicate; if a tuple does not satisfy the predicate, it is skipped over.

Hint: Be careful when using the TableIterator object. Make sure that you understand the difference between the pre-increment and post-increment operators. You may find yourself getting strange output by switching between ++iter and iter++.

Hint: You will want to make use of the predicate in the sequential scan plan node. In particular, take a look at AbstractExpression::Evaluate. Note that this returns a Value, which you can GetAs<bool>.

Insert

Inserts add tuples to tables. Inserts are specified by an InsertPlanNode. There are two types of inserts: raw inserts, where the values to be inserted are embedded directly inside the plan node itself, or not-raw inserts, which take the values to be inserted from a child executor. For example, you could have an InsertPlanNode whose child was a SeqScanPlanNode to copy one table into another.

Hash Join

Hash joins are used to combine the results of two child executors together. In this project, you just need to implement a basic hash join. For more details on hash joins, refer to the Lecture #11 notes on join algorithms. In this project, by convention the left child is used to build the hash table and the right child is used to probe.

We are providing you with a SimpleHashJoinHashTable implementation. We strongly recommend that you use this hash table and obtain full credit for this task before proceeding to Task #3.

We have also provided you with a HashValues function, which may be useful.

Hint: You will want to make use of the predicate in the hash join plan node. In particular, take a look at AbstractExpression::EvaluateJoin, which handles the left tuple and right tuple and their respective schemas. Note that this returns a Value, which you can GetAs<bool>.

Aggregation

Aggregations are used to combine multiple tuple results from a single child executor into a single tuple. In this project, we ask you to implement COUNT, SUM, MIN, and MAX.

Note that we provide you with a SimpleAggregationHashTable. We strongly recommend that you use this hash table. We have trimmed down the course project and you do not need to do this, but you may be interested in seeing what it takes to use your LinearProbeHashTable from Project #2 instead.

Hint: You will want to make use of the having expression in the aggregate plan node. In particular, take a look at AbstractExpression::EvaluateAggregate, which handles the groupbys of the key and the aggregates of the value. Note that this returns a Value, which you can GetAs<bool>.

Testing

We have provided sample tests in test/execution/executor_test.cpp.

Task #3 - Linear Probe Hash Table Returns

Intermission: We strongly recommend submitting to Gradescope at this point. This provides a backup of your submission, and you want to make sure that you get full credit for Tasks #1 and #2 before proceeding further.

This task will require you to modify your hash join executor. You will now use your Linear Probe Hash Table from Project #2 instead of the simplified hash join hash table, henceforth referred to as JHT.

Note: Task #3.1 can be skipped entirely. As long as you receive full credit for Task #3.2, you will also automatically receive full credit for Task #3.1.

It is likely that you will need to understand explicit template instantiation. You have already seen code that used explicit template instantiation in Project #2.

Task #3.1 - Linear Probe Hash Table, TmpTuplePage

This part of the project is intended to be open-ended. Again, receiving full credit for Task #3.2 will automatically give you full credit for Task #3.1.

You are provided, but are not required to use, the following stub classes:

  • TmpTuplePage at src/include/storage/page/tmp_tuple_page.h. In our implementation, we wanted a simplified version of a TablePage. Although this is not necessary, we found that this will make the development easier.
  • TmpTuple at src/include/storage/table/tmp_tuple.h. In our implementation, this solved certain issues that come from instantiating LinearProbeHashTable. You should think carefully about what those issues are.

To provide more scaffolding to students who prefer a more structured project, implementing missing functionality in TmpTuplePage will earn you full credit for Task #3.1. However, this may constrain you to working towards our solution. You are also free to implement the missing TmpTuplePage functionality and not use any of it.

Task #3.2 - Linear Probe Hash Table, Hash Join

Lastly, you will modify your hash join executor in src/include/execution/executors/hash_join_executor.h to use your hash table from Project #2. You may find TmpTuplePage useful, but there are other solutions. As long as the type of your hash table used in the hash join is a LinearProbeHashTable, you will obtain full credit.

Note: Depending on how you choose to use the LinearProbeHashTable, you may need to remove the uniqueness constraint from it. Feel free to do so. In our solution involving TmpTuplePage and TmpTuple, we did not need to.

You may also add functions to src/include/execution/executor_context.h if you find that helpful.

Grading Rubric

Each project submission will be graded based on the following criteria:

  1. Does the submission successfully execute all of the test cases and produce the correct answer?
  2. Does the submission execute without any memory leaks?
  3. Does the submission follow the code formatting and style policies?

Note that we will use additional test cases that are more complex and go beyond the sample test cases that we provide you.

Late Policy

See the late policy in the syllabus.

Submission

After completing the assignment, you can submit your implementation to Gradescope.

You need to include all the files from the previous projects.

Minimally, we expect the following files to be submitted:

  • src/include/catalog/simple_catalog.h
  • src/include/execution/executors/aggregation_executor.h
  • src/include/execution/executors/hash_join_executor.h
  • src/include/execution/executors/insert_executor.h
  • src/include/execution/executors/seq_scan_executor.h
  • src/include/execution/executor_context.h

In addition to files from Project #1 and Project #2:

  • src/include/storage/page/hash_table_header_page.h
  • src/storage/page/hash_table_header_page.cpp
  • src/include/storage/page/hash_table_block_page.h
  • src/storage/page/hash_table_block_page.cpp
  • src/include/container/hash/linear_probe_hash_table.h
  • src/container/hash/linear_probe_hash_table.cpp
  • src/include/buffer/clock_replacer.h
  • src/buffer/clock_replacer.cpp
  • src/include/buffer/buffer_pool_manager.h
  • src/buffer/buffer_pool_manager.cpp

You may add additional files as long as they are also under the src/ directory.

You can submit your answers as many times as you like and get immediate feedback.

Collaboration Policy

  • Every student has to work individually on this assignment.
  • Students are allowed to discuss high-level details about the project with others.
  • Students are not allowed to copy the contents of a white-board after a group meeting with other students.
  • Students are not allowed to copy the solutions from another colleague.

WARNING: All of the code for this project must be your own. You may not copy source code from other students or other sources that you find on the web. Plagiarism will not be tolerated. See CMU's Policy on Academic Integrity for additional information.