Diagnosing Oracle “reliable message” Waits

Reliable Message waits are cryptic by nature.  It is a general purpose wait event that tracks many different types of channel communications within the Oracle database.  I’ve read some blogs that suggest that this is a benign wait event that can be ignored.  My experience is that they are not benign and should not be ignored.  This post will show you how to decipher these events and resolve the issue.

Here is what you might see in an AWR report:

Continue reading

Using Wireshark to Diagnose a Connection Drop Issue in Oracle

As a long-time performance DBA, I’ve often felt that it is important to know something about troubleshooting the layers that are upstream and downstream of the database in the technology stack.  Lately, I’ve been making use of packet captures and Wireshark to solve tough issues in the TCP layer.  We recently resolved a long-standing issue with TCP retransmissions that were causing connection drops between an application server and one of our databases and I thought this might help others faced with similar issues.

This problem started with a series of TNS-12535 messages that were seen in the Oracle alert logs for one of our databases:

Continue reading

Investigating Linux Network Issues with netstat and nstat

One area that I’ve been spending quite a bit of time looking at lately is the TCP layer on our servers.  We have seen multiple issues that involve TCP and it is an oft-overlooked area when troubleshooting.

There are two tools that I’d like to focus on today – netstat and nstat.  Both tools pull statistics from the following Linux files, which track network-related statistics and SNMP counters:

/proc/net/netstat
/proc/net/snmp

Here is what the output of these two files looks like:

Continue reading

JavaOne/Oracle OpenWorld Highlights-Part Two

In my previous post, I covered the first two days of JavaOne and Oracle OpenWorld.  I’ll cover the rest of the week’s highlights in this post.

Blockchain and the Internet of Things (IoT)

This presentation was interesting in that it talked about different ways that blockchain and alt-chain technologies (like “pegged sidechains”) might be used in the Internet of Things.  It also introduces one company (Blockstream) who is using pegged sidechains to extend Bitcoin and blockchains for micropayments, smart contracts and property registries.  This paper also introduces the concept of “Sensing as a Service” – an emerging business model for the Internet of Things.

Continue reading

JavaOne/Oracle OpenWorld Highlights-Part One

I had the opportunity to attend JavaOne/Oracle OpenWorld in San Francisco this fall (along with the thundering herds below) and thought I’d share some of the highlights from my perspective.  Lots of good information and food for thought, so without any further ado…

General ThemesoptimizedSwarm

On the Java side, they are celebrating the 20th anniversary.  There were a number of sessions devoted to the Internet of Things (IoT) and that was very evident in the demo/vendor grounds.

On the Oracle side, the big theme was the Oracle Cloud.  Also interesting and new was the introduction of the new Sparc M7 processor, which introduces “software in silicon” and promises to offer much faster decompression, in-memory query acceleration and “silicon secured memory”.

Continue reading

Pumping Up Performance with Linux Hugepages – Part 2

optimizedhuge_book

In my previous article on hugepages, I discussed what hugepages are and talked about the page table, the Translation Lookaside Buffer (TLB) and TLB Misses, Page Walks and Page Faults. I also discussed how using hugepages reduces the amount of memory used and the also reduces the number of CPU cycles needed to do the logical to physical memory mapping.

In this post, I’d like to talk about how to use Hugepages with the Oracle database and with JVMs. I’ll also talk about Transparent Hugepages (THP) and why you should turn off this new Linux “feature”.

Continue reading

Pumping Up Performance with Linux Hugepages – Part 1

optimizedHans_and_Franz

Hans and Franz

As memory becomes cheaper, servers are delivered with larger memory configurations and applications are starting to address more of it. This is generally a good thing from a performance standpoint. However, this can create performance issues when you’re using the default memory page size of 4 KB on x86-based systems.

 

To address this, Linux has a feature called “hugepages” that allows applications (databases, JVMs, etc.) to allocate larger memory pages than the 4 KB default. Applications using hugepages can benefit from these larger page sizes because they have a greater chance of finding memory mapping info in cache and thereby avoid more expensive operations.

In order to understand the benefits of hugepages, it helps to know a bit more about memory mapping, page tables and the TLB (translation lookaside buffer).

Continue reading

Lies, Damned Lies and Metrics

(Apologies to Mark Twain…)

I’ve long subscribed to the principle of “Follow the Data” when it comes to troubleshooting performance. However, sometimes the data can be misleading (“lie” is an awful strong word) and sometimes the metrics you need just aren’t there. I was reminded of that this week while looking into a production performance issue with one of our critical applications.

The issue was presenting itself as an I/O problem in the database layer. Oracle wait event metrics from ASH (Active Session History) were indicating that I/O operations were taking longer than normal. Normally when we see this, we gather data about the I/O subsystem using utilities like iostat. Since this was on an Exadata, we also used the cellcli utility to report on storage cell information. However – this time – neither of these utilities was showing long I/O waits corresponding to our issue.

Continue reading

Red Flags – How a Data Quality Issue Becomes a Performance Issue

When I’m reviewing the query performance of our applications, one of SQL constructs that raises a red flag for me is the use of LOWER and UPPER functions in the WHERE clause (as well as TRIM/LTRIM/RTRIM).

The reason these functions trigger a red flag is because they are typically used as a workaround to a data quality problem and this workaround usually causes downstream performance impacts. Developers use these functions when they don’t trust the format of the data in the column or when they don’t trust the format of the value that is being compared.

How does this create a downstream performance problem?  Wrapping an indexed column in a function disables the use of the index and often causes a full table scan to occur. This can take a query that would normally execute in less than a millisecond and cause it to take much longer – depending on how big the table is that needs to be full-scanned.

Here is an example from a recently reviewed application:

Continue reading