(Apologies to Mark Twain…)
I’ve long subscribed to the principle of “Follow the Data” when it comes to troubleshooting performance. However, sometimes the data can be misleading (“lie” is an awful strong word) and sometimes the metrics you need just aren’t there. I was reminded of that this week while looking into a production performance issue with one of our critical applications.
The issue was presenting itself as an I/O problem in the database layer. Oracle wait event metrics from ASH (Active Session History) were indicating that I/O operations were taking longer than normal. Normally when we see this, we gather data about the I/O subsystem using utilities like iostat. Since this was on an Exadata, we also used the cellcli utility to report on storage cell information. However – this time – neither of these utilities was showing long I/O waits corresponding to our issue.