(Apologies to Mark Twain…)
I’ve long subscribed to the principle of “Follow the Data” when it comes to troubleshooting performance. However, sometimes the data can be misleading (“lie” is an awful strong word) and sometimes the metrics you need just aren’t there. I was reminded of that this week while looking into a production performance issue with one of our critical applications.
The issue was presenting itself as an I/O problem in the database layer. Oracle wait event metrics from ASH (Active Session History) were indicating that I/O operations were taking longer than normal. Normally when we see this, we gather data about the I/O subsystem using utilities like iostat. Since this was on an Exadata, we also used the cellcli utility to report on storage cell information. However – this time – neither of these utilities was showing long I/O waits corresponding to our issue.
Without corroborating I/O metrics, we had to start thinking about other resources that are involved in performing I/Os. Since it takes CPU cycles to perform I/O’s, we started looking at that resource first. Because Exadatas have lots of CPU, it takes a lot of usage to move the needle. Complicating this was the fact that the episodes that were occurring lasted less than one second. No noticeable spikes in CPU.
Since this was a Linux operating system, we also started looking down at the system call level, using tools like strace and perf. Again, nothing jumped out at us.
One thing we noticed was that we were not seeing the same long I/O waits on another database on the same server. This indicated that the problem was probably specific to the database we were looking at and not due to the system as a whole. We also noticed that the problem was not present two weeks before on this database. Circumstantial evidence at best, but it was something.
Intuition and Past Experience
At this point, we had to start relying on past experience and intuition. We had seen strange performance issues before with databases that were not using hugepages. Hugepages can be critical to the performance of applications (like Oracle) that use large amounts of memory and have lots of concurrent processes.
We checked the operating system and noticed that there were not enough hugepages configured for both databases. The database that was OK had 100% of its shared memory allocated from hugepages, while the database that was having “I/O issues” had only allocated 5% of its shared memory in hugepages. A closer look at the database logs showed that two weeks prior, the situation was reversed. This was the period of time when this DB was not having performance problems. The circumstantial evidence was starting to build…
We continued looking for the “magic metric” that would validate all of this and that would quickly point us to this problem in the future. Unfortunately, we weren’t able to find it. The perf utility was probably our best bet. However for some reason, it was not exposing data on “TLB misses” (Translation Lookaside Buffer) that would have further helped confirm the hugepages theory. More research needed here.
We now felt there was enough evidence to schedule a change to increase the amount of hugepages. We did that this past Friday and the results are looking much better. The service times have gone back to their normal levels. Root cause? A configuration issue that caused a memory mapping issue that in turn used enough extra CPU cycles that it caused calls to the I/O subsystem to take longer than usual. Simple, right?
Sometimes the data and metrics about a problem are misleading and we don’t have the luxury of direct evidence that shows us exactly where the problem is. When this happens, we have to rely on circumstantial evidence and intuition informed by past experience to get to problem resolution.