As a long-time performance DBA, I’ve often felt that it is important to know something about troubleshooting the layers that are upstream and downstream of the database in the technology stack. Lately, I’ve been making use of packet captures and Wireshark to solve tough issues in the TCP layer. We recently resolved a long-standing issue with TCP retransmissions that were causing connection drops between an application server and one of our databases and I thought this might help others faced with similar issues.
This problem started with a series of TNS-12535 messages that were seen in the Oracle alert logs for one of our databases:
One area that I’ve been spending quite a bit of time looking at lately is the TCP layer on our servers. We have seen multiple issues that involve TCP and it is an oft-overlooked area when troubleshooting.
There are two tools that I’d like to focus on today – netstat and nstat. Both tools pull statistics from the following Linux files, which track network-related statistics and SNMP counters:
Here is what the output of these two files looks like:
In my previous article on hugepages, I discussed what hugepages are and talked about the page table, the Translation Lookaside Buffer (TLB) and TLB Misses, Page Walks and Page Faults. I also discussed how using hugepages reduces the amount of memory used and the also reduces the number of CPU cycles needed to do the logical to physical memory mapping.
In this post, I’d like to talk about how to use Hugepages with the Oracle database and with JVMs. I’ll also talk about Transparent Hugepages (THP) and why you should turn off this new Linux “feature”.
Hans and Franz
As memory becomes cheaper, servers are delivered with larger memory configurations and applications are starting to address more of it. This is generally a good thing from a performance standpoint. However, this can create performance issues when you’re using the default memory page size of 4 KB on x86-based systems.
To address this, Linux has a feature called “hugepages” that allows applications (databases, JVMs, etc.) to allocate larger memory pages than the 4 KB default. Applications using hugepages can benefit from these larger page sizes because they have a greater chance of finding memory mapping info in cache and thereby avoid more expensive operations.
In order to understand the benefits of hugepages, it helps to know a bit more about memory mapping, page tables and the TLB (translation lookaside buffer).
(Apologies to Mark Twain…)
I’ve long subscribed to the principle of “Follow the Data” when it comes to troubleshooting performance. However, sometimes the data can be misleading (“lie” is an awful strong word) and sometimes the metrics you need just aren’t there. I was reminded of that this week while looking into a production performance issue with one of our critical applications.
The issue was presenting itself as an I/O problem in the database layer. Oracle wait event metrics from ASH (Active Session History) were indicating that I/O operations were taking longer than normal. Normally when we see this, we gather data about the I/O subsystem using utilities like iostat. Since this was on an Exadata, we also used the cellcli utility to report on storage cell information. However – this time – neither of these utilities was showing long I/O waits corresponding to our issue.