SecureFile LOB Performance Issue

I wanted to document an issue that we experienced in one of our databases after upgrading to Oracle 12c (12.1.0.2).  The issue was that occasionally an update to one of our tables would take much longer to execute than normal.  Unfortunately, this issue did not get caught in testing and made its way into production.  It didn’t happen every time the update occurred and for this reason, it was difficult to track down.

Here’s how I approached the troubleshooting of this issue and the bug that we identified.

Since this wasn’t happening every time but we knew which SQL statement was involved, I elected to do a SQL_ID trace:

alter system set events 'sql_trace [sql:sql_id=89hm7xzttp9c9] plan_stat=all_executions,wait=true,bind=false';

After setting the trace, I monitored the trace directory waiting to find the long running UPDATEs.  To do this, I grep’d the trace files looking for EXEC operations on that SQL_ID with long elapsed time (“e=”) values.  The longest one was just over 89 seconds:

find . -type f -mtime -1 -name "*.trc" -exec grep -ln "89hm7xzttp9c9" {} \; |xargs egrep "EXEC" |cut -d"," -f2|cut -d"=" -f2|sort -n
6134275
6406358
53427796
78365426
89383890

Once I got several trace files, I turned the SQL_ID trace off:

alter system set events 'sql_trace [sql:sql_id=89hm7xzttp9c9] off';

I then pulled the trace files that had the longest EXEC times.  Here is the pattern I saw in the various trace files:

 PARSING IN CURSOR . . .
 UPDATE  x SET . . .
 . . .
 RETURNING  INTO :dl0
 END OF STMT
 BINDS #140344624162424:
. . .
 WAIT #140344624162424: nam='db file sequential read' ela= 634 file#=107 block#=3220853 blocks=1 obj#=161211 tim=2967436896291
 WAIT #140344624162424: nam='SQL*Net message to client' ela= 2 driver id=1650815232 #bytes=1 p3=0 obj#=87726 tim=2967436896532
 EXEC #140344624162424:c=3000,e=3721,p=1,cr=4,cu=23,mis=0,r=1,dep=0,og=1,plh=2409839646,tim=2967436896563
 WAIT #140344624162424: nam='SQL*Net message from client' ela= 88 driver id=1650815232 #bytes=1 p3=0 obj#=87726 tim=2967436896739
 LOBPGSIZE: type=PERSISTENT LOB,bytes=8060,c=0,e=24,p=0,cr=0,cu=0,tim=2967436896797
 WAIT #0: nam='SQL*Net message to client' ela= 1 driver id=1650815232 #bytes=1 p3=0 obj#=87726 tim=2967436896823
 WAIT #0: nam='SQL*Net message from client' ela= 78 driver id=1650815232 #bytes=1 p3=0 obj#=87726 tim=2967436896918
 WAIT #140344624822120: nam='SQL*Net more data from client' ela= 13 driver id=1650815232 #bytes=222 p3=0 obj#=87726 tim=2967436897091
 WAIT #140344624822120: nam='db file sequential read' ela= 350 file#=124 block#=212669 blocks=1 obj#=87726 tim=2967436897811
 WAIT #140344624822120: nam='db file sequential read' ela= 329 file#=124 block#=212670 blocks=1 obj#=87726 tim=2967436898239
 WAIT #140344624822120: nam='db file sequential read' ela= 336 file#=124 block#=212671 blocks=1 obj#=87726 tim=2967436898675

. . .  < 196000 WAITS !!! > . . .

 WAIT #140344624822120: nam='gc current grant 2-way' ela= 238 p1=15 p2=22656 p3=33554444 obj#=87726 tim=2967520128096
 WAIT #140344624822120: nam='gc current grant 2-way' ela= 208 p1=15 p2=22657 p3=33554433 obj#=87726 tim=2967520129038
 WAIT #140344624822120: nam='gc current grant 2-way' ela= 189 p1=15 p2=22658 p3=33554433 obj#=87726 tim=2967520129359
 LOBWRITE: type=PERSISTENT LOB,bytes=6468,c=33450915,e=83232687,p=133280,cr=419937,cu=274972,tim=2967520129628
 WAIT #0: nam='SQL*Net message to client' ela= 2 driver id=1650815232 #bytes=1 p3=0 obj#=87726 tim=2967520129713
 WAIT #0: nam='SQL*Net message from client' ela= 608 driver id=1650815232 #bytes=1 p3=0 obj#=87726 tim=2967520130349

Between the LOBPGSIZE operation and the LOBWRITE operation, there were over 196,000 wait events, most of them “db file sequential read” and “gc cr disk read”.  The LOBWRITE operation took about 83 seconds.

I wanted to see if there were any particular blocks that were being visited as part of this operation so I grep’d and counted the various occurrences of the “sequential read” and “gc cr disk read” operations.   What I noticed is that the same blocks were being visited over and over again – some of them as many as 681 times:

$ grep "sequential read" pdb2_ora_76294.trc|cut -d" " -f9,10,12|sort|uniq -c|sort -rn|more
    681 file#=28 block#=1793355 obj#=96666
    681 file#=28 block#=1793354 obj#=96666
    681 file#=28 block#=1793353 obj#=96666
. . .
    680 file#=28 block#=1620571 obj#=96660
    680 file#=28 block#=1620570 obj#=96660
    680 file#=28 block#=1620569 obj#=96660
. . .
    320 file#=28 block#=3592814 obj#=96643
    320 file#=28 block#=3592813 obj#=96643
    320 file#=28 block#=3592812 obj#=96643
. . .

$ grep "gc cr disk read" pdb2_ora_76294.trc|cut -d" " -f9,10,12|sort|uniq -c|sort -rn|more
    681 p1=28 p2=1793355 obj#=96666
    681 p1=28 p2=1793354 obj#=96666
    681 p1=28 p2=1793353 obj#=96666
. . .
    680 p1=28 p2=1620570 obj#=96660
    680 p1=28 p2=1620569 obj#=96660
    680 p1=28 p2=1620568 obj#=96660
. . .
    320 p1=28 p2=3592814 obj#=96643
    320 p1=28 p2=3592813 obj#=96643
    320 p1=28 p2=3592812 obj#=96643
. . .

Here is what a “normal” execution of this UPDATE looked like.  The normal LOBWRITE took only 680 microseconds:

 PARSING IN CURSOR . . .
 UPDATE   set 
 RETURNING  INTO :dl0
 END OF STMT
 BINDS #140344626255832:
. . .
 WAIT #140344626255832: nam='SQL*Net message to client' ela= 1 driver id=1650815232 #bytes=1 p3=0 obj#=-1 tim=2967434169512
 EXEC #140344626255832:c=3000,e=4262,p=0,cr=5,cu=17,mis=0,r=1,dep=0,og=1,plh=2409839646,tim=2967434169557
 WAIT #140344626255832: nam='SQL*Net message from client' ela= 76 driver id=1650815232 #bytes=1 p3=0 obj#=-1 tim=2967434169722
 LOBPGSIZE: type=PERSISTENT LOB,bytes=8060,c=0,e=25,p=0,cr=0,cu=0,tim=2967434169813
 WAIT #0: nam='SQL*Net message to client' ela= 1 driver id=1650815232 #bytes=1 p3=0 obj#=-1 tim=2967434169845
 WAIT #0: nam='SQL*Net message from client' ela= 76 driver id=1650815232 #bytes=1 p3=0 obj#=-1 tim=2967434169938
 WAIT #140344635454240: nam='SQL*Net more data from client' ela= 12 driver id=1650815232 #bytes=222 p3=0 obj#=-1 tim=2967434170178
 LOBWRITE: type=PERSISTENT LOB,bytes=5358,c=0,e=680,p=0,cr=2,cu=8,tim=2967434170670
 WAIT #0: nam='SQL*Net message to client' ela= 1 driver id=1650815232 #bytes=1 p3=0 obj#=-1 tim=2967434170717
 WAIT #0: nam='SQL*Net message from client' ela= 254 driver id=1650815232 #bytes=1 p3=0 obj#=-1 tim=2967434170990

Believing this to be a bug, we then opened an SR and came to find out that we were experiencing bug # 22905136 – “SECUREFILE LOB HIGH SINGLE BLOCK PHYSICAL READ”.  The note for this bug is MOS Doc ID 2189248.1.

Applying the patch for this bug fixed the issue.

So if your application makes use of Securefile LOBs and you are upgrading to 12.1, check out this bug and make sure that this patch is applied before going into test and production.

“Result Cache: Channel” Reliable Message Waits

In this earlier post, I wrote about “reliable message” waits in Oracle and how to find out more about the channel that is being waited on.  One of these waits that we’ve been seeing with some frequency is on “Result Cache: Channel”.  In some cases, these waits have been debilitating and I believe it merits its own post, in case someone else runs into this issue.

Here is an example of one issue we experienced in 11.2.0.3.  Users were reporting a slow system and in looking at session waits, we saw long wait times for the “reliable message” event. In querying GV$CHANNEL_WAITS, we saw the following waits on instance 3 of a RAC database:

select inst_id, channel, messages_published, wait_count, WAIT_TIME_USEC/1000000 wait_time_sec
from GV$CHANNEL_WAITS
order by inst_id, wait_time_sec desc;
 INST
   ID CHANNEL                                                MESSAGES_PUBLISHED WAIT_COUNT WAIT_TIME_SEC
----- ------------------------------------------------------ ------------------ ---------- -------------
. . .
    3 Result Cache: Channel                                           474745796  473767959    267825.106
    3 obj broadcast channel                                             4928144    2643616      5065.185
    3 kxfp control signal channel                                       3349090    1140344       484.500
    3 MMON remote action broadcast channel                               401055       6999       266.940
    3 RBR channel                                                        320568      37809       137.223
. . .

Continue reading

Diagnosing Oracle “reliable message” Waits

Reliable Message waits are cryptic by nature.  It is a general purpose wait event that tracks many different types of channel communications within the Oracle database.  I’ve read some blogs that suggest that this is a benign wait event that can be ignored.  My experience is that they are not benign and should not be ignored.  This post will show you how to decipher these events and resolve the issue.

Here is what you might see in an AWR report:

Continue reading

Using Wireshark to Diagnose a Connection Drop Issue in Oracle

As a long-time performance DBA, I’ve often felt that it is important to know something about troubleshooting the layers that are upstream and downstream of the database in the technology stack.  Lately, I’ve been making use of packet captures and Wireshark to solve tough issues in the TCP layer.  We recently resolved a long-standing issue with TCP retransmissions that were causing connection drops between an application server and one of our databases and I thought this might help others faced with similar issues.

This problem started with a series of TNS-12535 messages that were seen in the Oracle alert logs for one of our databases:

Continue reading

JavaOne/Oracle OpenWorld Highlights-Part Two

In my previous post, I covered the first two days of JavaOne and Oracle OpenWorld.  I’ll cover the rest of the week’s highlights in this post.

Blockchain and the Internet of Things (IoT)

This presentation was interesting in that it talked about different ways that blockchain and alt-chain technologies (like “pegged sidechains”) might be used in the Internet of Things.  It also introduces one company (Blockstream) who is using pegged sidechains to extend Bitcoin and blockchains for micropayments, smart contracts and property registries.  This paper also introduces the concept of “Sensing as a Service” – an emerging business model for the Internet of Things.

Continue reading

JavaOne/Oracle OpenWorld Highlights-Part One

I had the opportunity to attend JavaOne/Oracle OpenWorld in San Francisco this fall (along with the thundering herds below) and thought I’d share some of the highlights from my perspective.  Lots of good information and food for thought, so without any further ado…

General ThemesoptimizedSwarm

On the Java side, they are celebrating the 20th anniversary.  There were a number of sessions devoted to the Internet of Things (IoT) and that was very evident in the demo/vendor grounds.

On the Oracle side, the big theme was the Oracle Cloud.  Also interesting and new was the introduction of the new Sparc M7 processor, which introduces “software in silicon” and promises to offer much faster decompression, in-memory query acceleration and “silicon secured memory”.

Continue reading

Pumping Up Performance with Linux Hugepages – Part 2

optimizedhuge_book

In my previous article on hugepages, I discussed what hugepages are and talked about the page table, the Translation Lookaside Buffer (TLB) and TLB Misses, Page Walks and Page Faults. I also discussed how using hugepages reduces the amount of memory used and the also reduces the number of CPU cycles needed to do the logical to physical memory mapping.

In this post, I’d like to talk about how to use Hugepages with the Oracle database and with JVMs. I’ll also talk about Transparent Hugepages (THP) and why you should turn off this new Linux “feature”.

Continue reading

Lies, Damned Lies and Metrics

(Apologies to Mark Twain…)

I’ve long subscribed to the principle of “Follow the Data” when it comes to troubleshooting performance. However, sometimes the data can be misleading (“lie” is an awful strong word) and sometimes the metrics you need just aren’t there. I was reminded of that this week while looking into a production performance issue with one of our critical applications.

The issue was presenting itself as an I/O problem in the database layer. Oracle wait event metrics from ASH (Active Session History) were indicating that I/O operations were taking longer than normal. Normally when we see this, we gather data about the I/O subsystem using utilities like iostat. Since this was on an Exadata, we also used the cellcli utility to report on storage cell information. However – this time – neither of these utilities was showing long I/O waits corresponding to our issue.

Continue reading