Project #3 - Query Execution

Do not post your project on a public Github repository.

Overview

At this point in the semester, you have implemented the internal components of a database management system. In Project #1, you implemented a buffer pool manager. In Project #2, you implemented a B+tree index. In this project, you will implement the components that allow BusTub to execute queries. You will create the operator executors that execute SQL queries and implement optimizer rules to transform query plans.

This project is composed of following tasks in which you will implement each of the individual executors according to the specifications provided below.

Task #1: Access Method Executors
Task #2: Aggregation and Join Executors
Task #3: Sort + Limit Executors and Top-N Optimization
Leaderboard Task (Optional)

This is a single-person project that will be completed individually (i.e., no groups).

Remember to pull latest code from the bustub repository before starting this project.

You can do this with the following command if you followed the setup instruction in BusTub:

git pull public master

Release Date: Oct 25, 2022
Due Date: Nov 16, 2022 @ 11:59pm

Background

We begin with discussing the basics of query processing. Please read this section carefully as you will need to construct SQL queries by yourselves in this project to test your executor implementation.

The below image is an overview of BusTub's archictecture:

In the public BusTub repository, we already provide you a full query processing layer. You can use the BusTub shell to execute SQL queries, like what you have done in homework 1. Use the following command to compile and build BusTub shell:

cd build && make -j$(nproc) shell
./bin/bustub-shell

You can also use BusTub Web Shell to run the examples below. It is a complete reference solution of the system running in your browser!

Within the shell, you can use \dt to view all tables. By default, the BusTub shell will automatically create three tables that are pre-populated with data. This is provided as a convience for you so that you do not have to load fake data every time you rebuild your code. Any changes to these tables will not be persisted when you restart the DBMS.

bustub> \dt
+-----+----------------+------------------------------+
| oid | name           | cols                         |
+-----+----------------+------------------------------+
| 0   | __mock_table_1 | (colA:INTEGER, colB:INTEGER) |
| 1   | __mock_table_2 | (colC:VARCHAR, colD:VARCHAR) |
| 2   | __mock_table_3 | (colE:INTEGER, colF:VARCHAR) |
| ... | ...            | ...                          |
+-----+----------------+------------------------------+

You can view all data from a table by using the SELECT statement:

bustub> SELECT * FROM __mock_table_1;
+---------------------+---------------------+
| __mock_table_1.colA | __mock_table_1.colB |
+---------------------+---------------------+
| 0                   | 0                   |
| 1                   | 100                 |
| 2                   | 200                 |
| 3                   | 300                 |
| 4                   | 400                 |
| 5                   | 500                 |
| ...                 | ...                 |
+---------------------+---------------------+

Important things to note:

BusTub only supports a small subset of SQL syntax. Don't be surprised if it does not work with some SQL queries. For all SQL queries supported in BusTub, refer to the SQLLogicTest files in tests/sql.
If you are using CLion to run the BusTub shell, please add a --disable-tty parameter to the shell, so that it works correctly in the CLion terminal.
Always end your statement with ; (except internal commands).
BusTub only supports INT and VARCHAR(n) type. Also you should use single quotes for strings, e.g., INSERT INTO table VALUES ('a').

Explain SQL Queries

We next need to discuss how the BusTub shell “knows” that SELECT * FROM __mock_table_1; should read everything from __mock_table_1 when given that input. BusTub also supports the EXPLAIN command for printing a query's execution plan. You can add EXPLAIN in front of any query to see this information:

bustub> EXPLAIN SELECT * FROM __mock_table_1;
=== BINDER ===
BoundSelect {
  table=BoundBaseTableRef { table=__mock_table_1, oid=0 },
  columns=[__mock_table_1.colA, __mock_table_1.colB],
  groupBy=[],
  having=,
  where=,
  limit=,
  offset=,
  order_by=[],
  is_distinct=false,
}
=== PLANNER ===
Projection { exprs=[#0.0, #0.1] } | (__mock_table_1.colA:INTEGER, __mock_table_1.colB:INTEGER)
MockScan { table=__mock_table_1 } | (__mock_table_1.colA:INTEGER, __mock_table_1.colB:INTEGER)
=== OPTIMIZER ===
MockScan { table=__mock_table_1 } | (__mock_table_1.colA:INTEGER, __mock_table_1.colB:INTEGER)

The result of EXPLAIN provides an overview of the transformation process within the query processing layer. The statement first goes into the parser and the binder, which produces an AST (abstract syntax tree). This is how BusTub understands what the query wants to do. In this example, the query wants to perform a BoundSelect on __mock_table_1 and retrieve two columns (i.e., colA and colB). Note that the binder automatically expands the * character in the original SQL query into the actual columns in the table.

Next, the binder AST goes into the planner, and the planner will produce a query plan corresponding to the query. The query is planned as a tree of two nodes. The data flows from the leaves of the tree to the root.

After that, the optimizer will optimize the query plan. In this case, it removes the projection plan node because it is redundant.

Let us consider a more complex example:

bustub> EXPLAIN (o) SELECT colA, MAX(colB) FROM
  (SELECT * FROM __mock_table_1, __mock_table_3 WHERE colA = colE) GROUP BY colA;
=== OPTIMIZER ===
Agg { types=[max], aggregates=[#0.1], group_by=[#0.0] }
  NestedLoopJoin { type=Inner, predicate=(#0.0=#1.0) }
    MockScan { table=__mock_table_1 }
    MockScan { table=__mock_table_3 }

In BusTub, the optimizer output is always a tree. For this example query, the tree is as follows:

In this project, you will need to construct SQL queries to test each of your executor's implementations. EXPLAIN is extremely important for you to know if a SQL query is using a specific executor.

If you are running this example in the web shell, you might see HashJoin instead of NestedLoopJoin. You do not need to implement HashJoin this semester.

Sample Executors

In the BusTub public repository, we already provide the implementation of several executors. Let's see when the planner will plan SQL queries as these plan nodes.

Projection

A projection plan node is used to do computation over an input. It will always have exactly one child. Try running the following example queries in the BusTub shell and examine their output:

EXPLAIN SELECT 1 + 2;
EXPLAIN SELECT colA FROM __mock_table_1;
EXPLAIN SELECT colA + colB AS a, 1 + 2 AS b FROM __mock_table_1;

A projection plan node contains several expressions for computation. It can be ColumnValueExpression (displayed as #0.0 when explained), which indicates directly placing a column of the child executor to this output column; or a ConstantExpression, which is a constant value (e.g., 1). Expressions are also represented as a tree. For example, 1 + 2 is an ArithmeticExpression with two ConstantExpression (1 and 2) as children.

Please note that the syntax #0.0 means that the first column in the first child. You will see something like #0.0 = #1.0 when planning joins.

Filter

A filter plan node is used to filter the output of a child using a given predicate. For example,

EXPLAIN SELECT * FROM __mock_table_1 WHERE colA > 1;

A filter plan node has exactly one child and contains a predicate.

Values

A values plan node is used to directly produce values.

EXPLAIN values (1, 2, 'a'), (3, 4, 'b');
CREATE TABLE table1(v1 INT, v2 INT, v3 VARCHAR(128));
EXPLAIN INSERT INTO table1 VALUES (1, 2, 'a'), (3, 4, 'b');

Values plan nodes are useful when inserting into a table with values supplied by the user.

Schema

As you might have noticed, when explaining, there is a long string of column descriptions after each plan node. It's the expected output schema of this plan node. Consider this example:

Projection { exprs=[#0.0, #0.1] } | (__mock_table_1.colA:INTEGER, __mock_table_1.colB:INTEGER)

This indicates that the executor representing this plan node will produce two columns. Both of them are of integer types. The output schema is inferred within the planner. For this project, your executor implementations must produce tuples with schema exactly as specified in the plan node, otherwise it will fail tests that check the output.

Project Specification

For this project, you will need to implement additional operator executors in BusTub. We will use the iterator query processing model (i.e., the Volcano model). Recall that in this model, every query plan executor implements a Next function. When the DBMS invokes an executor's Next function, the executor returns either (1) a single tuple or (2) an indicator that there are no more tuples. With this approach, each executor implements a loop that continues calling Next on its children to retrieve tuples and process them one-by-one.

In BusTub's implementation of the iterator model, the Next function for each executor returns a record identifier (RID) in addition to a tuple. A record identifier serves as a unique identifier for the tuple.

The executors are created from an execution plan in executor_factory.cpp.

All the test cases in this project are written in a special file format called SQLLogicTest (derived from SQLite). You can find how to use it at the end of this page.

Task #1 - Access Method Executors

In the background section, we saw that the BusTub is already able to retrieve data from mock tables in SELECT queries. This is because these are special tables that do not actually store real tuples. Instead they are "virtual" tables that use the MockScan executor to always generate the same tuples using a predefined algorithm. This is why you cannot update these tables.

In this task, you will need to implement executors that read from and write to the tables in the storage. You will complete your implementation in the following files:

src/include/execution/seq_scan_executor.h
src/execution/seq_scan_executor.cpp
src/include/execution/insert_executor.h
src/execution/insert_executor.cpp
src/include/execution/delete_executor.h
src/execution/delete_executor.cpp
src/include/execution/index_scan_executor.h
src/execution/index_scan_executor.cpp

SeqScan

The SeqScanPlanNode can be planned with a SELECT * from table statement.

bustub> CREATE TABLE t1(v1 INT, v2 VARCHAR(100));
Table created with id = 15
bustub> EXPLAIN (o,s) SELECT * FROM t1;
=== OPTIMIZER ===
SeqScan { table=t1 } | (t1.v1:INTEGER, t1.v2:VARCHAR)

The SeqScanExecutor iterates over a table and returns its tuples, one-at-a-time.

Hint: Be careful when using the TableIterator object. Make sure that you understand the difference between the pre-increment and post-increment operators. You may find yourself getting strange output by switching between ++iter and iter++.

Hint: The output of sequential scan is a copy of each matched tuple and its original record identifier (RID).

Note: BusTub does not support DROP TABLE or DROP INDEX for now. You can reset your database by simply restarting the shell.

Insert

The InsertPlanNode can be planned with a INSERT statement. Note that you will need to use a single quote to specify a VARCHAR value.

bustub> EXPLAIN (o,s) INSERT INTO t1 VALUES (1, 'a'), (2, 'b');
=== OPTIMIZER ===
Insert { table_oid=15 } | (__bustub_internal.insert_rows:INTEGER)
  Values { rows=2 } | (__values#0.0:INTEGER, __values#0.1:VARCHAR)

The InsertExecutor inserts tuples into a table and updates indexes. It has exactly one child producing values to be inserted into the table. The planner will ensure values have the same schema as the table. The executor will produce a single tuple of an integer number as the output, indicating how many rows have been inserted into the table, after all rows are inserted. Remember to update the index when inserting into the table, if it has an index associated with it.

Hint: You will need to lookup table information for the target of the insert during executor initialization. See the System Catalog section below for additional information on accessing the catalog.

Hint: You will need to update all indexes for the table into which tuples are inserted. See the Index Updates section below for further details.

Hint: You will need to use the TableHeap class to perform table modifications.

Delete

The DeletePlanNode can be planned with a DELETE statement. It has exactly one child with the records to be deleted from the table. Your delete executor should produce an integer output that represents the number of rows that it deleted from the table. It will also need to update the index.

bustub> EXPLAIN (o,s) DELETE FROM t1;
=== OPTIMIZER ===
Delete { table_oid=15 } | (__bustub_internal.delete_rows:INTEGER)
  Filter { predicate=true } | (t1.v1:INTEGER, t1.v2:VARCHAR)
    SeqScan { table=t1 } | (t1.v1:INTEGER, t1.v2:VARCHAR)

bustub> EXPLAIN (o,s) DELETE FROM t1 where v1 = 1;
=== OPTIMIZER ===
Delete { table_oid=15 } | (__bustub_internal.delete_rows:INTEGER)
  Filter { predicate=#0.0=1 } | (t1.v1:INTEGER, t1.v2:VARCHAR)
    SeqScan { table=t1 } | (t1.v1:INTEGER, t1.v2:VARCHAR)

You may assume that the DeleteExecutor is always at the root of the query plan in which it appears. The DeleteExecutor should not modify its result set.

Hint: You only need to get a RID from the child executor and call TableHeap::MarkDelete() to effectively delete the tuple. All deletes will be applied upon transaction commit.

Hint: You will need to update all indexes for the table from which tuples are deleted. See the Index Updates section below for further details.

IndexScan

The IndexScanExecutor iterates over an index to retrieve RIDs for tuples. The operator then uses these RIDs to retrieve their tuples in the corresponding table. It then emits these tuples one-at-a-time.

You can test your index scan executor by SELECT FROM <table> ORDER BY <index column>. We will explain why ORDER BY can be transformed into IndexScan in Task #3.

bustub> CREATE TABLE t2(v3 int, v4 int);
Table created with id = 16

bustub> CREATE INDEX t2v3 ON t2(v3);
Index created with id = 0

bustub> EXPLAIN (o,s) SELECT * FROM t2 ORDER BY v3;
=== OPTIMIZER ===
IndexScan { index_oid=0 } | (t2.v3:INTEGER, t2.v4:INTEGER)

The type of the index object in the plan will always be BPlusTreeIndexForOneIntegerColumn in this project. You can safely cast it and store it in the executor object:

tree_ = dynamic_cast<BPlusTreeIndexForOneIntegerColumn *>(index_info_->index_.get())

You can then construct index iterator from the index object, scan through all the keys and tuple IDs, lookup the tuple from the table heap, and emit all tuples in the index key's order as the output of the executor. BusTub only supports indexes with a single, unique integer column. There will not be duplicate keys in the test cases.

Hint: Now that you have implemented all storage related executors. In the following tasks, you can create tables and insert some values by yourself to test your own executor implementation! At this point, you should also have passed SQLLogicTests #1 to #5.

Task #2 - Aggregation & Join Executors

You will complete your implementation in the following files:

src/include/execution/aggregation_executor.h
src/execution/aggregation_executor.cpp
src/include/execution/nested_loop_join_executor.h
src/execution/nested_loop_join_executor.cpp
src/include/execution/nested_index_join_executor.h
src/execution/nested_index_join_executor.cpp

Aggregation

The AggregationPlanNode is used to support the following queries:

EXPLAIN SELECT colA, MIN(colB) FROM __mock_table_1 GROUP BY colA;
EXPLAIN SELECT COUNT(colA), min(colB) FROM __mock_table_1;
EXPLAIN SELECT colA, MIN(colB) FROM __mock_table_1 GROUP BY colA HAVING MAX(colB) > 10;
EXPLAIN SELECT DISTINCT colA, colB FROM __mock_table_1;

Note that the aggregation executor itself won't need to handle the having predicate. The planner will plan having as a FilterPlanNode. Aggregation executor only needs to do aggregation for each group of input. It has exactly one child.

The schema of aggregation is group-by columns, followed by aggregation columns.

As discussed in Lecture 10, a common strategy for implementing aggregation is to use a hash table. This is the method that you will use in this project, however, we make the simplifying assumption that the aggregation hash table fits entirely in memory. This means that you do not need to worry about implementing the two-stage (Partition, Rehash) strategy for hash aggregations. You can also assume that all of your aggregation results can reside in an in-memory hash table (i.e, the hash table does not need to be backed by buffer pool pages).

We provide you with the SimpleAggregationHashTable data structure that exposes an in-memory hash table (std::unordered_map) but with an interface designed for computing aggregations. This class also exposes the SimpleAggregationHashTable::Iterator type that can be used to iterate through the hash table. You will need to fill out the CombineAggregateValues function for this class.

Hint: Recall that, in the context of a query plan, aggregations are pipeline breakers. This may influence the way that you use the AggregationExecutor::Init() and AggregationExecutor::Next() functions in your implementation. In particular, think about whether the Build phase of the aggregation should be performed in AggregationExecutor::Init() or AggregationExecutor::Next().

Hint: You must consider how to handle NULLs in the input of the aggregation functions (i.e., a tuple may have a NULL value for the attribute used in the aggregation function). See test cases for expected behavior. Group-by columns will never be NULL.

Hint: When performing aggregation on an empty table, CountStarAggregate should return zero and all other aggregate types should return integer_null. This is why GenerateInitialAggregateValue initializes most aggregate values as NULL.

NestedLoopJoin

The DBMS will use NestedLoopJoinPlanNode for all join operations by default. Consider the following example queries:

EXPLAIN SELECT * FROM __mock_table_1, __mock_table_3 WHERE colA = colE;
EXPLAIN SELECT * FROM __mock_table_1 INNER JOIN __mock_table_3 ON colA = colE;
EXPLAIN SELECT * FROM __mock_table_1 LEFT OUTER JOIN __mock_table_3 ON colA = colE;

You will need to implement inner join and left join for NestedLoopJoinExecutor by using the basic nested loop join algorithm mentioned in the lecture. The output schema of this operator is all columns from the left table followed by all the columns from the right table.

This executor should implement the simple nested loop join algorithm presented in Lecture 11. That is, for each tuple in the join's outer table, you should consider each tuple in the join's inner table, and emit an output tuple if the join predicate is satisfied.

Hint: You will want to make use of the predicate in the NestedLoopJoinPlanNode. In particular, take a look at AbstractExpression::EvaluateJoin, which handles the left tuple and right tuple and their respective schemas. Note that this returns a Value, which could be false, true, or NULL. See FilterExecutor on how to apply predicates on tuples.

NestedIndexJoin

The DBMS will use NestedIndexJoinPlanNode if the query contains a join with an equi-condition and the right side of the join has an index over the condition.

Consider the following example:

CREATE TABLE t1(v1 int, v2 int);
CREATE TABLE t2(v3 int, v4 int);
CREATE INDEX t2v3 on t2(v3);
EXPLAIN SELECT * FROM t1 INNER JOIN t2 ON v1 = v3;
=== PLANNER ===
Projection { exprs=[#0.0, #0.1, #0.2, #0.3] } | (t1.v1:INTEGER, t1.v2:INTEGER, t2.v3:INTEGER, t2.v4:INTEGER)
  NestedLoopJoin { predicate=#0.0=#1.0 } | (t1.v1:INTEGER, t1.v2:INTEGER, t2.v3:INTEGER, t2.v4:INTEGER)
    SeqScan { table=t1 } | (t1.v1:INTEGER, t1.v2:INTEGER)
    SeqScan { table=t2 } | (t2.v3:INTEGER, t2.v4:INTEGER)
=== OPTIMIZER ===
NestedIndexJoin { key_predicate=#0.0, index=t2v3, index_table=t2 }
  SeqScan { table=t1 }

In the plan phase, the query is planned as a NestedLoopJoin of two tables. The optimizer identifies that the right side of the join (SeqScan t2) has an index on column v3, and the join condition is an equi-condition v1 = v3. This means that for all tuples from the left side, the system can use the key v1 to query the index t2v3 to produce the join result.

The schema of NestedIndexJoin is all columns from the left table (child, outer) and then from the right table (index, inner). This executor will have only one child that propagates tuples corresponding to the outer table of the join. For each of these tuples, you will need to find the corresponding tuple in the inner table that matches the index key given by utilizing the index in the catalog.

Hint: You will want to fetch the tuple from the outer table, construct the index probe key by using key_predicate, and then look up the RID in the index to retrieve the corresponding tuple for the inner table.

Note: We will never insert duplicate rows into tables with indexes.

Note: We will provide all test cases on Gradescope AS-IS. You only need to pass the tests. Do not think of strange edge cases of NULLs (e.g., NULLs in group by and in indices)

Hint: At this point, you should have passed SQLLogicTests - #6 to #12.

Task #3 - Sort + Limit Executors and Top-N Optimization

You will complete your implementation in the following files:

src/include/execution/sort_executor.h
src/execution/sort_executor.cpp
src/include/execution/limit_executor.h
src/execution/limit_executor.cpp
src/include/execution/topn_executor.h
src/execution/topn_executor.cpp
src/optimizer/sort_limit_as_topn.cpp

You need to implement IndexScanExecutor in Task #1 before starting this task. If there is an index over a table, the query processing layer will automatically pick it for sorting. In other cases, you will need a special sort executor to do this.

For all order by clauses, we assume every sort key will only appear once. You do not need to worry about ties in sorting.

Sort

Except in the case that the ORDER BY attributes matches the keys of an index, BusTub will use a SortPlanNode for all ORDER BY operators.

EXPLAIN SELECT * FROM __mock_table_1 ORDER BY colA ASC, colB DESC;

This plan node does not change schema (i.e., the output schema is the same as the input schema). You can extract sort keys from order_bys, and then use std::sort with a custom comparator to sort all tuples from the child. You can assume that all entries in a table can fit in memory.

If the query does not include a sort direction in the ORDER BY clause (i.e., ASC, DESC), then the sort mode will be default (which is ASC).

Limit

The LimitPlanNode specifies the number of tuples that query will generate. Consider the following example:

EXPLAIN SELECT * FROM __mock_table_1 LIMIT 10;

The LimitExecutor constrains the number of output tuples from its child executorr. If the number of tuples produced by its child executor is less than the limit specified in the plan node, this executor has no effect and yields all of the tuples that it receives.

This plan node does not change schema (i.e., the output schema is the same as the input schema). You do not need to support offsets.

Top-N Optimization Rule

For this last task, you are going to modify BusTub's optimizer to support converting top-N queries. Consider the following query:

EXPLAIN SELECT * FROM __mock_table_1 ORDER BY colA LIMIT 10;

By default, BusTub will execute this query by (1) sort all data from the table (2) get the first 10 elements. This is obviously inefficient, since the query only needs the smallest values. A smarter way of doing this is to dynamically keep track of the smallest 10 elements so far. This is what the BusTub's TopNExecutor does.

You will need to modify the optimizer to support converting a query with ORDER BY + LIMIT clauses to use the TopNExecutor. See OptimizeSortLimitAsTopN for more information.

An example of the optimized plan of this query:

 TopN { n=10, order_bys=[(Default, #0.0)]} | (__mock_table_1.colA:INTEGER, __mock_table_1.colB:INTEGER)
   MockScan { table=__mock_table_1 } | (__mock_table_1.colA:INTEGER, __mock_table_1.colB:INTEGER)

Note: Think of what data structure can be used to track the top n elements (Andy mentioned it in the lecture). The struct should hold at most k elements (where k is the number specified in LIMIT clause).

Note: Though we did not say it explicitly, the BusTub optimizer is a rule-based optimizer. Most optimizer rules construct optimized plans in a bottom-up way.

Hint: At this point, your implementation should be able to pass SQLLogicTests #13 to #16. Integration-test-2 requires you to use release mode to run.

Leaderboard Task (Optional)

For this project's leaderboard challenge, we are giving you the SQL queries ahead of time. It is up to you to implement new executors and optimizer rules to make the system execute these queries as fast possible.

The leaderboard is optional (i.e., you do not need to do this to get a perfect score on the project).

It is possible that your implementation will produce different optimization results for existing queries after implementing the optimizations in the leaderboard test. We require you to pass all tests after implementing new optimization rules. Meanwhile, we will also force using starter rules for some test cases. For example, in order to ensure your index scan executor works, we force the starter rule in this sqllogictest file with set force_optimizer_starter_rule=yes.

Query 1: Where's the Index?

Consider the following sample database:

CREATE TABLE t1(x INT, y INT);
CREATE TABLE t2(x INT, y INT);
CREATE TABLE t3(x INT, y INT);
CREATE INDEX t1x ON t1(x);

Now a user comes along and executes the following query. Note that this query is not the same as the leaderboard query; please refer to the test file.

SELECT * FROM (t1 INNER JOIN t2 ON t1.x = t2.x) INNER JOIN t3 ON t2.y = t3.y;

Even though there is an index on t1.x, BusTub does not pick it for the join! What a 💩 database system! Furthermore, there are two nested loop joins, which is extremely inefficient! Oops!

Recommended Optimizations: Use hash join to handle equi-condition; join reordering to pick the index for t1; join t2 and t3 first based on the cardinality (use EstimatedCardinality function). Note that hash join is NOT required for a full score in this project. We also have an existing rule for converting NLJ into HashJoin and you will need to manually enable it. See optimizer_custom_rules.cpp for more information.

Query 2: Too Many Joins!

Consider the following sample database:

CREATE TABLE t4(x int, y int);
CREATE TABLE t5(x int, y int);
CREATE TABLE t6(x int, y int);

The user is not from CMU and they are writing terrible SQL. They forgot how write queries with joins so they puts all predicates in the WHERE clause.

SELECT * FROM t4, t5, t6
  WHERE (t4.x = t5.x) AND (t5.y = t6.y) AND (t4.y >= 1000000)
    AND (t4.y < 1500000) AND (t6.x >= 100000) AND (t6.x < 150000);

(Not the same as the actual leaderboard query, refer to the test file. We've already pushed one filter down in the actual leaderboard query.)

Recommended Optimizations: Decompose filter condition to extract hash join keys, push down filter below hash join to reduce data from the table scan.

Query 3: The Mad Data Scientist

There is a data scientist invested all their money in NFTs. After realizing their terrible mistake, they go crazy and starts writing some weird SQL queries. Consider the following example:

SELECT v, d1, d2 FROM (
  SELECT v,
         MAX(v1) AS d1, MIN(v1), MAX(v2), MIN(v2),
         MAX(v1) + MIN(v1), MAX(v2) + MIN(v2),
         MAX(v1) + MAX(v1) + MAX(v2) AS d2
    FROM t7 LEFT JOIN (SELECT v4 FROM t8 WHERE 1 == 2) ON v < v4
    GROUP BY v
)

(Not the same as the actual leaderboard query, refer to the test file.)

Recommended Optimizations: Column Pruning – you only need to compute v, d1, d2 from the left table in aggregation, common expression elimination, transform always false filter to dummy scan (values plan node of zero rows).

Hint: You do not need to implement a complete rule for optimizing these queries. (1) a complete predicate pushdown requires you to handle all plan nodes – limit, order by, etc. But to optimize for Q2, you only need to implement push down predicates over hash join / nested loop joins. (2) a complete join reordering requires you to handle predicates correctly (and maybe absorb filters in-between back to the join predicate), and you do not need to do that. Just make your optimizer work with those queries is enough.

Additional Information

This section provides some additional information on other system components in BusTub that you will need to interact in order to complete this project.

System Catalog

A database maintains an internal catalog to keep track of meta-data about the database. In this project, you will interact with the system catalog to query information regarding tables, indexes, and their schemas.

The entirety of the catalog implementation is in src/include/catalog/catalog.h. You should pay particular attention to the member functions Catalog::GetTable() and Catalog::GetIndex(). You will use these functions in the implementation of your executors to query the catalog for tables and indexes.

Index Updates

For the table modification executors (InsertExecutor and DeleteExecutor) you must modify all indexes for the table targeted by the operation. You may find the Catalog::GetTableIndexes() function useful for querying all of the indexes defined for a particular table. Once you have the IndexInfo instance for each of the table's indexes, you can invoke index modification operations on the underlying index structure.

In this project, we use your implementation of B+ Tree Index from Project 2 as the underlying data structure for all index operations. Therefore, successful completion of this project relies on a working implementation of the B+ Tree Index.

Instructions

See the Project #0 instructions on how to create your private repository and setup your development environment.

You must pull the latest changes from the upstream BusTub repository for test files and other supplementary files we provide in this project.

Testing

We will use SQLLogicTest to perform testing and benchmarking. To use it,

make -j$(nproc) sqllogictest
./bin/bustub-sqllogictest ../test/sql/p3.00-primer.slt --verbose

You can use the bustub-sqllogictest program to run slt files. Remember to recompile sqllogictest before doing any testing. In this project, we provide ALL test cases to you. There are no hidden tests. The test cases are located at test/sql/.

Development Hints

Instead of using printf statements for debugging, use the LOG_* macros for logging information like this:

LOG_DEBUG("Fetching page %d", page_id);

To enable logging in your project, you will need to configure it in Debug mode:

cd build
cmake -DCMAKE_BUILD_TYPE=Debug ..
make -j$(nproc)

The different logging levels are defined in src/include/common/logger.h. After enabling logging, the logging level defaults to LOG_LEVEL_INFO. Any logging method with a level that is equal to or higher than LOG_LEVEL_INFO (e.g., LOG_INFO, LOG_WARN, LOG_ERROR) will emit logging information.

We also recommend using assertions to check preconditions, postconditions, and invariants in your implementation. The macros header defines the BUSTUB_ASSERT and UNREACHABLE macros that may be helpful in this capacity.

Post all of your questions about this project on Piazza. Do not email the TAs directly with questions.

Grading Rubric

Each project submission will be graded based on the following criteria:

Does the submission successfully execute all of the test cases and produce the correct answer?
Does the submission execute without any memory leaks?

Late Policy

See the late policy in the syllabus.

Submission

After completing the assignment, you can submit your implementation to Gradescope for evaluation.

https://www.gradescope.com/courses/424375/

Running make submit-p3 in your build/ directory will generate a zip archive called project3-submission.zip under your project root directory that you can submit to Gradescope.

Remember to resolve all style issues before submitting:

make format
make check-lint
make check-clang-tidy-p3

Collaboration Policy

The collaboration policy for this assignment is as follows:

Every student has to work individually on this assignment.
Students are allowed to discuss high-level details about the project with others.
Students are not allowed to copy the contents of a white-board after a group meeting with other students.
Students are not allowed to copy the solutions from other students.

WARNING: All of the code for this project must be your own. You may not copy source code from other students or other sources that you find on the web. Plagiarism will not be tolerated. See CMU's Policy on Academic Integrity for additional information.

Last Updated: Nov 01, 2022