Tuesday, July 25, 2006

Tech World :: Effective Indexes

Effective Indexes for Data Warehouses
Roger Deng
Note: This article was originally printed in DB2 Magazine
Creating the right indexes to boost performance is an art. These guidelines will help you master the process.

Indexes are important objects in a relational database. With effective indexes, complex queries run much faster than they would with less optimal indexes. But query performance depends less on how many indexes exist on the tables than on what kind of indexes exist. Figuring out how to create the most effective indexes is a big challenge.

DB2 Universal Database (UDB) includes an Index Advisor, which makes the process easier. In my work with DB2 UDB version 7.2 for Linux, Unix, and Windows, I use it a lot. Sometimes, however, you may want to test whether the Index Advisor's recommendations are the most effective. To test these recommendations, my team and I create an index using a set of guidelines that I'll share with you in this article. Then we run (and time) test queries. If the Index Advisor's recommendation results in better performance, we use it. If not, we use the index we created.

If you need to create indexes on your own (whether to test the Index Advisor's recommendations or for another reason), here are some suggestions for effective index design and examples of indexes you should avoid.

WHAT DO YOU NEED?

Two types of applications use the database: online transaction processing (OLTP) and decision support (including online analytic processing). OLTP queries are usually simple and involve many data change operations (such as INSERT , UPDATE , and DELETE ). The indexes in this environment are usually easy to design. In general, you should minimize the number of OLTP indexes so that transactions can execute very quickly (usually in less than a second). Too many indexes causes performance to suffer.

In a data warehouse or decision support environment, however, most queries are complex. This complexity necessitates the presence of many indexes, which are built during bulk loading. Because queries in a data warehouse environment are complex and dynamic, creating effective indexes can be tricky.

In a data warehouse environment, you probably use a tool to capture user queries and then analyze them with DB2 Explain. That analysis gives you some idea of which indexes are needed. Index Advisor can also help identify the indexes. But using the tool is only the first step. You should look at the whole picture of database load, design, and usage and review all the indexes on your tables to make sure they're effective.

In the data warehouse environment, queries are unpredictable. You can't expect that they'll all run within seconds. Indexes make SELECT s faster, but they also make UPDATE s slower. So, the first rule of thumb on whether to add indexes is to ask yourself a simple question: Are the indexes necessary to ensure that queries will run (and won't run too long)? If queries are completing within a reasonable time, don't create indexes to help them run faster.

GOOD INDEX CANDIDATES

Usually, you'll want to create indexes on the fields in a query's predicate (the WHERE clause). However, not all fields should be indexed: Whether indexing makes sense depends on the field's cardinality. For example, creating an index on the GENDER column isn't a wise choice, because there are only two distinct values for this column. You should create indexes, on the other hand, for high cardinality columns. Indexes on those columns will help DB2's cost-based optimizer. In general, high cardinality columns are good candidates for indexing.

If you must build an index on the column GENDER (for example, if there are many queries doing GROUP BY GENDER and other dimensions), a compound index, which includes more than one field, could work. Instead of only indexing GENDER , for example, add CUSTOMER NUMBER to the index. Adding the high cardinality field will balance the index and avoid the high maintenance that a normal index on a low cardinality field would incur. (Most indexes are stored in B-tree data structures; low cardinality causes the index page to split.) The other choice is to use a bitmap index, which I'll discuss later in this article.

Composite indexes (sometimes called multicolumn indexes) are generally recommended if queries often contain the same columns joined with AND or if you have keys with skewed domain counts or low cardinality. You can safely use up to five columns in a compound index; anything more than that will result in expensive maintenance. The left-most column in a compound index should be the one that occurs most often in the queries. The order of the columns in the WHERE clause once had to match the order in the compound index; however, modern optimizers make that rule obsolete.

There is a misconception that DB2 has to use all the indexes in order to get good performance. But that's not always true. Sometimes, the cost-based optimizer will prefer to scan a table instead of using the indexes. For example, if column A is greater than zero 90 percent of the time, the optimizer will choose to scan the table for the following query:

SELECT * FROM TABLE1 WHERE A > 0;

Let's look at another example. Say there are two indexes on a table, FIRST_NM and LAST_NM, and the following query occurs:

SELECT * FROM TABLE1
WHERE FIRST_NM='ROGER' AND
LAST_NM='DENG';

The DBMS may use the index on LAST_NM and scan the DBMS to match the FIRST_NM with value ROGER instead of joining the two indexes in order to avoid a third read of a data page. Sometimes, DBAs create more indexes than they actually need because they don't know that the optimizer will take such an approach. If you know that the user will most often specify LAST_NM (or LAST_NM and FIRST_NM together), the best approach would be to build a composite index on LAST_NM and FIRST_NM instead of two separate indexes.

Another form of compound index, called a covering index, includes some SELECT fields. Covering indexes save one disk read, but only when all the SELECT columns are in the indexes. If the table's row size is big, using a covering index will dramatically increase query performance because the index will act like a small table. The query will just need to read the index to get the result. Take this example:

SELECT LAST_NM, FIRST_NM, GENDER
FROM TABLE1
WHERE LAST_NM LIKE 'M%';

An index that contains all three columns in this SELECT (LAST_NM, FIRST_NM, and GENDER) on a table that has more than 500 columns can really help performance. In general, if the number of rows in the table is large, try to use a covering index to avoid reading the data page. There are limitations, though. The DBMS will never use covering indexes when there are joins or groupings. Covering indexes are most appropriate for use in read-only data warehouse environments. In read/write OLTP environments, the maintenance would be too high.

DUPLICATE INDEXES

Before adding a new index to a table, check to see if a similar index already exists. Say you have an index on column 1 and you want to add an index on column 1 plus column 2. You should replace the existing index on column 1 instead of adding a new one. Adding a new index on column 1 and column 2 will cause a redundant index. Redundant indexes, of course, take extra space and slow data updates. Worse, they confuse the optimizer with several equivalent query paths. So, drop any redundant indexes.

Consolidate indexes on a table, if you can, to eliminate some overlap. I once saw a DBA create the following compound indexes in DB2 v.7.2:

Index 1 on (C1, C2, C3, C4)
Index 2 on (C1, C2, C3, C5)
Index 3 on (C1, C2, C5)
Index 4 on (C1, C2, C6, C5).
C1, C2 is the primary key on the table. The DBA thought these indexes would make user queries run faster because the columns are all in the indexes. He was trying to use covering indexes; but there's no need to create four indexes in DB2 v.7.2. One index on (C1, C2) that includes (C3, C4, C5, C6) should be enough.

CLUSTERED INDEXES

Clustering an index means storing data rows according to the index order. Searching on a clustered index is much faster. You can have only one clustered index per table; usually, the clustering index is the primary key. You don't need to explicitly create indexes on primary key columns because the system creates them implicitly. To qualify as a primary key, the value of the columns should not be nullable and should be unique on the row. In order to make the primary key the clustering index, use CREATE TABLE. If you try to add it later using an ALTER TABLE statement, the system might not create it as a clustered index.

To qualify for the clustered index, a column should be:

Not too volatile
Short (preferably an integer)
Unique
The column on which a user will most often do a range search
In the order in which the user wants to get results.
BITMAP INDEXES

Originally, DBMSs stored indexes in a B-tree. Modern DBMSs now also use a bitmap index. Bitmap indexes save index disk space and are good for getting statistics out of a large number of rows with static data. Because data warehouses usually contain a lot of data and require many indexes, bitmap indexes are frequently used in a data warehouse environment. The B-tree index is effective only when the selectivity is higher than 0.1. Selectivity is is defined as (number of distinct values) / (total number of values).

If the selectivity is lower than 0.1, a bitmap index will be a good choice. Bitmap indexes are particularly useful for queries that contain [NOT] EXISTS, UNIQUE, or GROUP BY. For those queries, you should use bitmap indexes on low cardinality columns. On the contrary, a high cardinality field, such as social security number, would not be a good candidate for bitmap indexes.

THE ART OF INDEXES

For a DBA, creating indexes is like creating a work of art. Good design (that balances system maintenance load and query performance) is essential. The best indexes serve several queries instead of just one. When that occurs, the return on your index investment grows.

Utilities such as Index Advisor are handy, but remember that you're responsible for the final decision. DBAs must understand the logic behind the utility's choices or suggestions. If the logic doesn't seem to make sense, test it. Tools can't handle every situation, especially those that require consolidating indexes for an overall view. Familiarity with the essentials of indexing helps DBAs make smart decisions and create the most effective indexes for the database.

For a clustering index, new rows are inserted physically close to existing rows with similar key values. This yields a performance benefit during queries because it results in a more linear access pattern to data pages and more effective pre-fetching.

If you want a primary key index to be a clustering index, a primary key should not be specified at CREATE TABLE. Once a primary key is created, the associated index cannot be modified. Instead, perform a CREATE TABLE without a primary key clause. Then issue a CREATE INDEX statement, specifying clustering attributes. Finally, use the ALTER TABLE statement to add a primary key that corresponds to the index just created. This index will be used as the primary key index.

Generally, clustering is more effectively maintained if the clustering index is unique.