The Fullscan Strawman pattern (ramblings)

January 9, 2023

Database indices enable highly optimized execution paths for the retrieval of information. However, they have their own limitations when the data patterns and query patterns become more complex.

Take, for instance, a table containing data for several thousand employees within an organization with the following fields:

Employee ID
First Name
Last Name
Gender
Department
Location
Manager
Profile Picture

When we start with no indices on the table, the database server would read each and every row before being able to give us a result for just about any query. This seems wasteful of computational resources - the data storage I/O operations, the CPU, and the memory used for caching. For a one-off query, this may be perfectly okay but the query execution time becomes unbounded and it is just not scalable.

Creating an index makes data retrieval faster because it takes the data for the columns selected for indexing and builds a (sometimes… more on that later) separate data structure in the form of a tree, a hashtable, or similar structure that it can use to quickly prepare a shortlist of rows that it needs to fetch from the table. If we frequently have a need to get a list of everyone who reports to a specific manager, we can create an index on the manager field. This is a single-field index. Whenever the database needs to fetch the list of employees reporting to a specific manager, it would traverse a tree or look up a hashtable to get row identifiers that it can then use to fetch data from tables. If this is the only query pattern on the table, the table and index can be built together as an index-organized table or clustered-table - this helps avoid performing a fetch from the table as a separate operation.

When querying and sorting on multiple fields, we are able to achieve good performance with compound indices. These indices can be used for multiple query patterns as long as the queries are based on the leading fields of the indices. Also, the order in which fields used for equality, range, and sort matters - the field used for equality filtering should appear before those used for sort, and the fields used for equality filtering should appear before those used for range filtering.

Because it is not possible to create an index for every possible query pattern, we sometimes have to rely on partial-use of indices, index-intersection, or full-table scans. Full-table scans are very demanding on the storage infrastructure because each row has to be read in its entirety even if all the columns are not being used for the query execution. This can be addressed by creating a separate table specifically for the full-table scans - let’s call this the full-scan strawman pattern, for lack of a better name. Within this new table, the size of individual rows is smaller. The gains from this approach are also very dependent on the size of processing fields in comparison to the size of the processing fields. Indices can be created on low-cardinality fields of this table to help improve query performance.

Notes

When a single-field index is created on a field with low cardinality, it may not improve query performance by much. However, every additional index reduces write performance and increases cache usage. If the performance difference is negligible, drop the index.