Red Flags – How a Data Quality Issue Becomes a Performance Issue

When I’m reviewing the query performance of our applications, one of SQL constructs that raises a red flag for me is the use of LOWER and UPPER functions in the WHERE clause (as well as TRIM/LTRIM/RTRIM).

The reason these functions trigger a red flag is because they are typically used as a workaround to a data quality problem and this workaround usually causes downstream performance impacts. Developers use these functions when they don’t trust the format of the data in the column or when they don’t trust the format of the value that is being compared.

How does this create a downstream performance problem?  Wrapping an indexed column in a function disables the use of the index and often causes a full table scan to occur. This can take a query that would normally execute in less than a millisecond and cause it to take much longer – depending on how big the table is that needs to be full-scanned.

Here is an example from a recently reviewed application:

select * from
 (select user_id, org_id, company_name_text,
    crte_by, crte_dttm, dept_cd, first_nam, last_login_datetime, lst_nam, lst_updt_dttm,
    stat_cd, token_serial_number, user_alias
  from USER_INFO
  where upper(user_id)=upper(:1 )
where rownum <= :2

| Id  | Operation          | Name      |
|   0 | SELECT STATEMENT   |           |
|*  1 |  COUNT STOPKEY     |           |

As you can see, the execution plan for this query is doing a full table scan. When I ran a test of this, the query took 78 milliseconds and did over 2500 block reads to scan the table. (This time will only increase as the table gets larger).

When I changed the query to remove the UPPER functions, it took only 70 micro-seconds and did only 3 block reads, because it was able to use the index. A 1000x+ improvement.

I then took a look at the USER_ID column for this table. Out of 120,000 rows, 99.99% have USER_ID values that are all lower case. Only 11 rows have a USER_ID value that is a mix of upper and lower case and some of those rows look suspicious.

For the sake of those 11 rows, we have full-scanned this table over 13 million times in the past month, by applying the UPPER function – and that is just for this one query. There are other queries in this application that do the same thing.

What about a Function-Based Index?

We could create a function-based index on the expression “UPPER(user_id)” and get the benefit of indexed access. However, this puts us in the situation of applying a workaround to address the negative effects of a workaround that was applied to address bad data quality. It also means that we now have another index to maintain every time a row is inserted/updated/deleted.

In the long run, it is better to address the data quality issues and avoid the workarounds.

How do we avoid this situation?

  • we can evaluate the data to see if we really need to convert it to a particular case in order to be able to query / filter on it.
  • if our business rules say that the data has to be stored in all upper- or lower-case (or in any other format), then we should convert it once at the time we insert or update that data – not every time we query the data. To put it another way – data cleansing should occur before the data is persisted to the database.
  • if we know we have bad data in our tables, we should take the time to fix that data (and whatever process allowed that bad data in) so that we don’t have to put performance-killing workarounds in our SQL.

By paying attention to the quality of our data, we can avoid the workarounds that contribute to poor query performance.


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s