As a long-time performance DBA, I’ve often felt that it is important to know something about troubleshooting the layers that are upstream and downstream of the database in the technology stack. Lately, I’ve been making use of packet captures and Wireshark to solve tough issues in the TCP layer. We recently resolved a long-standing issue with TCP retransmissions that were causing connection drops between an application server and one of our databases and I thought this might help others faced with similar issues.
This problem started with a series of TNS-12535 messages that were seen in the Oracle alert logs for one of our databases:
Fatal NI connect error 12170. . . . Time: 09-AUG-2015 01:01:21 Tracing not turned on. Tns error struct: ns main err code: 12535 TNS-12535: TNS:operation timed out ns secondary err code: 12560 nt main err code: 505 TNS-00505: Operation timed out nt secondary err code: 110 nt OS err code: 0 Client address: (ADDRESS=(PROTOTOL=tcp)(HOST=10.xxx.xxx.222)(PORT=39488)
Support tickets were opened with Oracle and with the vendor of the third-party application (CA). At this point, our DBAs began a months-long foray into “vendors-point-fingers-at-each-other hell”. Eventually, they contacted our performance engineering team and I engaged with them.
First stop was the Oracle MOS site. There was one hit (Doc ID 1286376.1) that was a pretty close fit for the error messages we were seeing, but all it pointed to was that a “client connection has experienced a timeout”.
Leveraging previous experience with TCP retransmissions and working with our network team, we found that this app server and database were conversing across a networking device that was known to be problematic and was dropping packets. Since that device was end-of-life, it was scheduled for replacement. The replacement was expedited and it was expected that the connection drop issue would go away. And it did, mostly…
…but not everywhere…
We were still occasionally seeing this issue in one of our data centers but not the other. Enter packet captures and Wireshark. Because this error wasn’t occurring frequently, our network team set up packet captures in the background. When a connection drop occurred, we supplied the times, they pulled the packets and we ran them through Wireshark. We also pulled packets when we weren’t seeing errors to see what normal looked like.
A Pattern Emerges
One of the things I like best about Wireshark is its filtering ability. Once I identified the TCP stream that experienced the connection drop, I filtered the packets to just show me retransmitted packets and the RST (reset) packets. For example, for one of the events, the filter I used was:
tcp.port==1527 && tcp.stream==19 && vlan.id==144 && (tcp.analysis.retransmission || tcp.flags==0x0004)
Basically – focus on TCP stream #19 and only look at those packets that were going back and forth from the database (port 1527) on VLAN 144 and where the packet was a retransmission or a reset/RST (flag 0x0004).
Here’s a blow-up of the relevant area within Wireshark:
The numbers on the left are the packet sequence numbers. Notice that just prior to the two connection resets ( the “[RST]” packets in red ), we see that sequence numbers 461068 and 78839 both get retransmitted four times. We saw this pattern across several of the connection drop events and every time, we saw four retransmissions and then the connection reset. Packets retransmitted three times did not experience the reset and we did not see any packets retransmitted five times.
After Googling around a bit, I found some hits on a Linux TCP parameter called tcp_retries2, which is defined in the Linux man page for TCP as “The maximum number of times a TCP packet is retransmitted in established state before giving up”. The Linux default is 15, but we were seeing this behavior after four retransmissions.
Further investigation on the servers revealed that this parameter was actually set to 3. We found the settings by looking in two places:
/proc/sys/net/ipv4/tcp_retries2 – contains the current value of the parameter
/etc/sysctl.conf – used to modify kernel parameters at runtime
In fact, there are several other parameters that influence TCP connections and when they timeout. They include:
tcp_keepalive_intvl tcp_keepalive_probes tcp_keepalive_time tcp_retries tcp_syn_retries tcp_synack_retries
The descriptions for these can be found in the man page for TCP – here is a web-based version of the man page.
So – why was tcp_retries2 set to 3, when the default is 15? After talking with our DBA and Systems teams, we found that this and other TCP-related parameter changes were recommended by Red Hat consulting several years ago during an assessment and has been in place since this set of DB servers was built. In fact, this information still exists out on the web in this presentation. The intent was to make failed connections fail faster – however, in this case, it was causing issues by making the TCP layer less tolerant of network conditions where a packet can occasionally be retransmitted multiple times. And a certain number of TCP retransmissions are normal on a busy corporate network – TCP is a reliable protocol that ensures packet delivery via retransmissions.
Since we were not seeing issues on systems that were using the defaults, we decided to set the parameters back to their original default values. While we didn’t feel that this parameter setting addressed the central question (why are we seeing multiple retransmissions in one site but not another?), increasing its value back to the default of 15 did make the TCP layer more tolerant of multiple retransmissions and the dropped connections problem went away.